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Abstract 

A  speech  processing  system  is  designed  to  simulate  the  transmission  of  speech 
signals  using  a  speech  coding  scheme.  The  transmitter  portion  of  the  simulation 
extracts  a  minimized  set  of  frequencies  in  Fourier  space  which  represents  the  essence 
of  each  of  the  speech  timeslices.  These  parameters  are  then  adaptively  quantized 
and  transmitted  to  a  receiver  portion  of  the  coding  scheme.  The  receiver  then  gen¬ 
erates  an  estimate  of  the  original  timeslice  from  the  transmitted  parameters  using  a 
sinusoidal  speech  model.  After  initial  design,  the  thesis  investigates  how  each  of  the 
design  parameters  affect  the  human  perceived  quality  of  speech.  This  is  done  with 
listening  tests.  The  listening  tests  consist  of  having  volunteers  listen  to  a  series  of 
speech  reconstructions.  Each  reconstruction  is  the  result  of  the  coding  scheme  acting 
on  the  same  speech  input  file  with  the  design  parameters  varied.  The  design  parame¬ 
ters  which  are  varied  are:  number  of  frequencies  used  in  the  sinusoidal  speech  model 
for  reconstruction,  number  of  bits  to  encode  amplitude  information,  and  number  of 
bits  used  to  code  phase  information.  The  final  design  parameters  for  the  coding 
scheme  were  selected  based  on  the  results  of  the  listening  tests.  Post  design  listening 
tests  showed  that  the  system  was  capable  of  4800  bps  speech  transmission  with  a 
quality  rating  of  five  on  a  scale  from  zero  (not  understandable)  to  ten  (sounds  just 
like  original  speech). 
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RULE-BASED  FREQUENCY  DOMAIN 
'  SPEECH  CODING 


/.  Introduction 

The  search  for  more  robust  and  bandwidth  efficient  ways  to  transmit  speech 
signals  has  recently  led  researchers  into  the  speech  analysis/synthesis  area  (1,  2,  9). 
These  systems  have  an  advantage  over  the  established  analog  speech  transmission 
systems  in  the  area  of  noise  rejection;  therefore,  they  are  more  robust.  However, 
the  analog  transmission  techniques  have  the  more  modest  bandwidth  requirements 
of  about  8  KHz  for  speech.  Speech  analysis/synthesis  techniques  usually  must  be 
coupled  with  digital  signal  transmission  systems  th*t  typically  require  a  greater  sys¬ 
tem  bandwidth  to  communicate  the  same  information  in  digital  format  as  compared 
to  an  analog  format  (13:4).  Therefore  the  new  analysis/synthesis  coding  schemes 
must  be  able  to  code  speech  more  efficiently  than  the  analog  coding  systems  or  they 
will  not  be  a  viable  alternative  to  the  analog  systems.  Therefore,  the  next  step 
in  the  analysis/synthesis  research  area  is  to  ensure  that  the  coding  schemes  are  as 
bandwidth  efficient  as  possible. 

1.1  Background 

In  March  1989,  Nadeem  A.  Bashir  published  his  thesis  on  the  reduction  of 
noise  in  speech  signals.  In  his  thesis,  Bashir  reconstructed  speech  signals  by  first 
estimating  the  speakers  glottal  frequency,  then  transmitting  the  amount  of  energy 
measured  in  frequency  bands  corresponding  to  exact  harmonics  of  the  glottal  fre¬ 
quency  estimate.  The  problem  was,  if  exact  harmonics  of  the  glottal  frequency  were 
chosen  for  reconstructing  the  speech,  a  noise,  which  sounded  like  music  playing  in 
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the  background  corrupted  the  reconstructed  speech  (see  Literature  Review  for  an 
in-depth  analysis  of  the  Bashir  system).  To  eliminate  the  "musical  noise”  effect, 
he  selected  the  highest  energy  frequency  within  the  nearest  four  neighbors  of  the 
harmonics  of  the  glottal  frequency  estimate.  I'-  then  reconstructed  the  speech  from 
this  spectra.  After  implementing  this  algorithm,  Bashir  was  able  to  reconstruct 
speech  signals  which  were  perceived  to  be  indistinguishable  from  the  original  speech 
by  human  listeners  (2).  The  ability  of  this  system  to  reconstruct  pristine  speech 
from  selected  frequency  components  made  it  a  candidate  for  research  as  a  possible 
midband  or  lowband  rate  speech  coding  system. 

In  March  1990,  Capt.  M.  F.  Alenquer  was  successful  in  adapting  the  noise 
reduction  scheme,  demonstrated  by  Bashir,  as  a  speech  coding  and  transmission 
scheme.  He  also  began  the  search  for  the  lowest  possible  data  rate  for  the  system* 
The  expression  for  finding  the  data  rate  of  the  system  was  found  to  be: 

DataRate  =  2L/T(A  +  P  +  8)  (1) 


Where: 

•  L  is  the  number  of  components  per  frame. 

•  A  is  the  number  of  bits  transmitted  to  convey  amplitude  information. 

•  P  is  the  number  of  bits  transmitted  to  convey  the  phase  information. 

•  T  is  the  frame  time  duration. 

Alenquer  reduced  the  system  data  rate  by  minimizing  the  three  parameters  A,  P 
and  L  through  subjective  listening  test.  He  was  able  to  obtain  speech  reconstruction 
data  rates  as  low  as  18  Kbps  (1). 
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1.2  Problem 


The  objective  of  this  thesis  is  to  implement  a  speech  analysis/synthesis  coding 
system  based  on  the  work  of  Bashir  and  Alenquer  which  transmits  low  bit  rate  speech 
signals  at  near  toll  quality. 

l.S  Research  Questions 

The  following  is  a  list  of  research  questions  which  will  be  addressed  in  this 
thesis: 

1.  Can  the  system  be  modified  to  accept  a  lower  data  rate  by  transmitting  the 
glottal  frequency  (8  bits),  then  transmitting  the  relative  position  of  the  next 
selected  frequency  (2  bits)?  If  this  is  possible  the  data  rate  equation  will  change 
from  Equation  (1)  to: 

DataRate  =  2 /T(A  +  P  +  8)  +  2 /T{L  -  1)(A  +  P  +  2),  (2) 

leading  to  a  substantial  data  rate  reduction. 

2.  How  does  the  perceived  reconstucted  speech  quality  change  as  a  function  of 
the  design  parameters? 

3.  Can  an  adaptive  quantization  scheme  be  developed  which  matches  itself  to  the 
expected  value  of  the  amplitudes  of  the  spectral  components? 

1.4  Assumptions 

The  following  assumptions  are  made  in  this  thesis: 

1.  As  described  by  Alenquer,  the  best  speech  reconstruction  results  occur  while 
using  the  four  nearest  neighbors,  frequency-selection  algorithm  (1:4-2). 

2.  Alenquer’s  algorithm  performs  as  he  intended  and  produces  the  effect  on  speech 
as  described  in  his  thesis  (1). 
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3.  Any  noise  which  occurs  when  the  system  design  parameters  of  the  algorithm 
are  varied  is  the  result  of  information  loss  and  not  the  result  of  Hamming 
window  leakage. 

1.5  Scope 

A  speech  transmission  system  is  designed  based  on  the  work  of  Alenquer  and 
Bashir  using  frequency  domain  analysis  techniques.  The  speech  is  reconstructed 
from  the  frequency,  amplitude,  and  phase  of  '■elected  spectral  components  (1:1-5). 
The  spectrum  of  the  speech  is  obtained  by  taking  a  1024-point  DFT  of  time  framed 
speech.  The  signal  is  reconstructed  by  inverse  transforming  the  vectors  containing 
the  selected  frequency  components. 

1.6  Approach 

The  firft  step  of  the  thesis  will  be  to  design  a  speech  coding  system  which 
is  based  on  a  combination  of  the  characteristics  of  Alenquer  and  Bashir’s  work. 
The  speech  processing  system  will  be  software  designed  to  simulate  the  transmitter 
and  receiver  in  a  digital  communications  link.  The  coding  system  will  exploit  the 
characteristics  of  the  speech  spectra  including: 

1.  Speech  energy  rolls  off  at  6  dB  per  octave  above  600  Hz.(12) 

2.  Speech  spectra  are  stationary  for  up  to  40  ms  time  periods.  (10) 

3.  The  basic  shape  of  the  spectra  remains  approximately  constant  for  all  voiced 
speech  signals,  only  the  locations  of  the  maxima  change. 

The  second  step  of  this  thesis  will  be  to  establish  a  high  fidelity  audio  tape 
containing  several  reproductions  of  an  utterance.  Each  reproduction  will  be  the 
result  of  the  speech  processing  system  operating  on  the  same  utterance,  but  with 
the  system  parameters  varied.  The  system  parameters  to  be  varied  are  components 
per  frame  (L),  bits  transmitted  for  amplitude  information  (A),  and  bits  transmitted 
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for  phase  information  (P).  The  tape  will  then  be  played  for  a  number  of  listeners 
who  will  be  asked  to  rate  each  utterance  on  a  scale  form  zero(could  not  understand 
the  utterance)  to  ten  (sounded  just  like  original  speech).  This  information  will  be 
used  to  develop  a  family  of  curves  which  will  show  how  the  quality  of  reproduced 
speech  is  affected  by  varying  each  of  the  design  parameters  one  parameter  at  a  time. 

In  the  final  step  of  this  thesis,  a  “best  guess”  will  be  made  of  the  parameters  A, 
L,  and  P  which  will  minimize  the  data  rate  of  the  system  without  greatly  diminishing 
the  quality  of  speech  based  on  the  listening  test  results.  These  parameters  will  be 
implemented  in  the  speech  processing  system  final  design. 

1.7  Definitions 

1.  Glottis  is  the  opening  between  the  vocal  cords.  (7) 

2.  Voiced  sounds  are  sounds  produced  by  forcing  air  through  the  vocal  cords, 
thereby  producing  quasi-periodic  pulses  of  air  which  excite  the  vocal  tract. 
(11) 

3.  Unvoiced  or  Fricative  sounds  are  sounds  produced  by  forcing  air  through 
a  constriction  in  the  vocal  tract,  thereby  producing  turbulence.  The  resulting 
broad-spectrum  noise  excites  the  vocal  tract.  (11) 

4.  Formants  are  resonances  in  the  vocal  tract.  (7) 

5.  Glottal  or  Fundamental  Frequency  is  the  rate  at  which  the  glottis  opens 
and  closes  during  voice  speech.  (7) 

1.8  Sequence  of  Presentation 

Chapter  II  presents  a  literature  review  of  recent  work  in  the  area  of  speech 
transmission  systems  at  both  AFIT  and  MIT  Lincoln  Laboratory. 

Chapter  III  gives  a  description  of  the  speech  reconstruction  processing  system 
hardware  and  software  environment. 
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Chapter  IV  describes  the  algorithm. 

Chapter  V  describes  the  System  Testing. 

Chapter  VI  presents  the  results,  of  the  tests.  Chapter  VII  gives  the  final 
conclusions  and  recommendations. 

The  Appendices  contain  sample  data  and  the  computer  program  source  code. 
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II.  Literature  Review 


2.1  Introduction 

Since  the  beginning  of  the  use  of  electrical  signaling  techniques,  one  of  the 
questions  which  has  concerned  the  telecommunications  engineer  is  how  to  send  more 
information  on  the  limited  resources  available.  Cherry  describes  one  telecommuni¬ 
cations  engineer  who  found  a  way  to  better  use  existing  resources.  His  name  was 
Samuel  Morse.  After  inventing  the  telegraph,  Morse  analyzed  the  use  of  the  alpha¬ 
bet  to  determine  which  letters  were  used  the  most.  He  then  assigned  the  shortest 
codes  to  the  letters  most  often  used,  thus  maximizing  the  use  of  the  limited  telegraph 
resources.(3:37)  In  1928,  Hartley  showed  that  to  transmit  a  “quantity  of  informa¬ 
tion,”  in  a  given  amount  of  time  required  a  certain  minimum  amount  of  bandwidth 
(3:43,44).  With  this  discovery,  the  telecommunications  engineer  was  doomed  forever 
to  consider  the  tradeoff  between  time  of  transmission  and  channel  bandwidth.  The 
quest  to  find  more  ingenious  ways  to  use  the  resources  available  intensified. 

One  area  of  particular  interest  for  maximizing  the  use  of  resources  is  the  speech 
transmission  area.  Cherry  writes: 


...in  order  to  obtain  more  economic  transmission  of  speech  signals, 
in  view  of  the  bandwidth  x  time  law,  something,  drastic  had  to  be  done 
to  the  speech  signals  themselves  to  remove  those  elements  which  do  not 
contribute  markedly  to  the  speech  intelligibility.  These  considerations 
led  to  what  is  known  as  the  vocoder,  an  instrument  for  analyzing,  and 
subsequently  resynthesizing,  speech;  a  “talking  machine”  which  requires 
only  control  signals  to  be  transmitted  and  received  in  order  to  reproduce 
intelligible  speech  (3:45) . 


Vocoders  have  been  refined  and  made  more  efficient  since  Cherry  wrote  this  passage. 
However,  the  search  for  the  most  efficient  means  to  transmit  voice  signals  continues 
because  todays  communications  engineers  are  faced  with  the  same  dilemma  as  their 
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predecessors:  find  better  ways  to  use  existing  transmission  resources  or  go  through 
the  costly  and  time  consuming  effort  of  installing  new  systems. 

This  literature  review  will  examine  some  of  the  important  factors  involved  in 
efficient  speech  encoding.  Section  1.2  will  describe  how  voice  coders  are  catego¬ 
rized  into  one  of  two  broad  categories;  waveform  coders  or  synthesis/analysis  coders. 
Section  1.3  will  cover  the  spectral  properties  of  speech  which  make  it  possible  to 
code  speech  for  transmission  by  selecting  certain  limited  number  of  parameters  for 
transmission  to  a  receiver.  Section  1.4  will  review  recent  research  efforts  in  the  field 
of  sinusoidal  encoding  of  speech.  Finally,  the  chapter  will  conclude  with  a  brief 
summary  of  the  Literature  Review. 

2.2  Speech  Coding 

Speech  coding  systems  can  be  categorized  as  either  waveform  coders  or  synthe¬ 
sis/analysis  coders(13:649).  Waveform  coders  estimate  the  input  signal  amplitude 
during  a  given  time  interval.  The  coder  then  transmits  the  estimate  to  the  decoder  at 
the  receiving  end  of  the  system  and  an  estimate  of  the  original  signal  is  reproduced. 
Examples  of  waveform  encoders  include  Pulse  Code  Modulation  (PCM)  coders  and 
Differential  Pulse  Code  Modulation  (DPGM)  coders  (1:1-2). 

Synthesis/analysis  coders  approach  signal  estimation  in  a  different  way.  In 
this:  scheme,  the  characteristics  of  the  power  spectra  of  the  signal  to  be  transmitted 
are  exploited.  If  the  power  spectrum  of  the  signal  is  constant  over  a  short  period 
of  time,  the  coder  can  select  the  characteristics  of  the  short-term  spectra  which 
contain  the  speech  information.  These  characteristics  may  then  be  transmitted  to 
the  decoder  at  the  receiving  end  of  the  system  and  the  signal  is  reproduced  from 
its  spectra.  Systems  which  use  the  synthesis/analysis  to  transmit  speech  signals  are 
called  vocoders  (13:64Q). 

The  first  vocc  ier  was  described  by  H.  Dudley  of  Bell  Labs  and  was  demon¬ 
strated  at  the  1939  New  York  Worlds  Fair  (13:651).  Th e  channel  coder  consisted  of  a 
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bank  of  narrowband  filters,  detectors,  and  averagers,  which  form  a  local  estimate  of 
the  power  distribution  over  the  frequency  band  (13:651,652).  The  speech  was  then 
reconstructed  from  the  estimates  of  the  power  distribution. 

2.3  Spectral  Properties  of  Speech 

As  described  by  Cherry,  voiced  speech  may  be  modeled  as  a  two  part  process. 
In  the  initial  part  of  the  process,  the  vocal  cords  generate  a  steady  tone.  The  tone 
is  periodic  and  therefore  has  a  periodic  Fourier  (line)  spectrum  of  harmonics.  The 
frequency  spacing  between  the  lines  in  the  spectra  depends  on  the  speaker.  Males 
have  a  lower  frequency  voice;  thus,  the  line  spacing  is  closer  than  in  a  female  voice. 
In  the  second  part  of  the  process,  the  tone  generated  by  the  vocal  cords  is  acted 
upon  by  the  resonances  of  the  vocal  tract.  These  resonances  vary  depending  on  the 
voiced  sound  being  generated.  The  frequencies  where  the  resonances  occur  are  called 
formants  (3:155,156). 

When  generating  unvoiced  speech,  humans  do  not  use  the  larynx  to  generate 
a  glottal  tone.  Therefore,  unvoiced  speech  does  not  have  the  periodicity  of  voiced 
speech.  The  energy  spectra  of  unvoiced  speech  closely  resembles  the  uniform  dis¬ 
tributed  spectra  of  noise.  An  example  of  unvoiced  speech  is  the  “s”  sound  (5). 

Figure  la  shows  the  “selective  characteristics”  of  the  vocal  tract  for  the  vowels 
“u”  and  “i”.  Cherry  obtained  these  characteristics  by  measuring  the  response  of 
the  vocal  tract  as  the  vowel  sounds  were  being  whispered.  While  whispering,  the 
airstream  sets  up  a  acoustic  spectrum  of  uniform  energy;  thus,  the  response  is  due 
solely  to  the  selective  characteristics  of  the  vocal  tract  (3:155).  Figure  lb  shows  the 
approximate  spectrum  of  the  larynx  and  Figure  lc  shows  the  output  spectrum  of 
the  voiced  vowel  “u”. 

As  seen  in  Figure  1,  each  vowel  has  a  distinct  vocal  tract  frequency  response. 
This  concept  has  been  extended  to  words  and  sentences.  That  is,  the  frequency 
response  of  the  vocal  tract  has  a  distinct  frequency  response  in  time.  That  this  is 
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(*)  Wav*  form  of  larynx  aource  energy 
and  In  approximate  opectrum 


Figure  1.  Spectra 


vowel  [u] 


of  Sustained  Vowels  (Male  Voice)(3:156) 
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Decibels 


Figure  2.  A  “Visible  Speech”  Spectogram  (Sonogram)(3:148) 
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The  English  rolled  (r) 


true  may  be  demonstrated  with  spectograms.  Spectograms  are  representations  of  the 
energy  spectra  of  speech  for  discrete  periods  of  time  (3:146).  A  sample  spectogram  is 
shown  in  Figure  2.  One  vertical  slice  of  the  spectogram  in  Figure  2  would  represent 
the  energy  spectrum  of  Figure  1.  It  is  in  the  shape  of  the  frequency  response  of  the 
vocal  tract  that  the  information  lies  (5). 

Sundberg  states: 

Moving  the  articulary  organs  is  what  we  do  when  we  speak  and  sing; 
in  effect  we  chew  the  standing  waves  of  our  formants  to  change  their 
frequencies.  Each  articulary  configuration  corresponds  to  a  set  of  formant 
frequencies,  which  in  turn  is  associated  with  a  particular  sound.  (14:84) 

Thus,  we  have  information  lying  in  the  continuously  changing  frequency  response  of 
the  human  vocal  tract.  The  spectral  properties  of  this  speech  appear  to  be  stationary 
for  20  to  50  ms  time  intervals  (13:649).  There  is  an  inherent  redundancy  in  speech 
which  allows  the  listener  to  understand  mutilated  speech  or  speech  corrupted  by 
noise  (2:1-1).  The  problem  that  must  now  be  solved  is  how  to  select  characteristics 
of  speech  for  transmission  which  carry  the  the  necessary  information,  but  are  not 
redundant  (5).  This  reduced  set  of  information  may  then  be  sent  on  the  transmission 
facilities,  thus  maximizing  the  use  of  the  voice  transmission  facilities. 

2.4  Sinusoidal  Encoding  of  Speech 

One  model  which  exploits  the  characteristics  of  the  spectrum  of  speech  is  the 
sinusoidal  speech  model.  According  to  McAulay  and  Quatieri,  this  model  is  charac¬ 
terized  by  speech  that  is  assumed  to  be  the  result  of  passing  the  glottal  excitation 
input  through  a  linear  time  invariant  filter  which  models  the  response  of  the  vocal 
tract.  Thus,  the  output  speech  waveform  s(t)  may  be  estimated  by  the  well  known 
linear  time  invariant  system  model: 
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Figure  3.  Peak  Selection  in  McAulay  and  Quatieri  System(9:747) 

where  h(t)  is  the  impulse  response  of  the  vocal  tract  and  e(t)  is  the  waveform  gen¬ 
erated  by  the  glottal  excitation  (9:744,745).  The  sinusoidal  speech  model  has  been 
proven  to  be  effective  in  reproducing  speech  which  is  almost  indistinguishable  from 
the  original  speech  (1,  2,  9). 

Using  the  sinusoidal  speech  model,  there  are  several  methods  of  choosing  which 
frequencies  in  the  speech  spectogram  carry  the  necessary  information.  McAulay 
and  Quatieri  generated  spectograms  for  intervals  of  speech  10  ms  long.  They  then 
extracted  the  amplitudes,  phases,  and  frequencies  of  the  sinusoidal  components  of 
each  interval  of  speech  by  using  a  512-point  Fast  Fourier  Transform  (FFT)  and  an 
adaptive  Hamming  window  of  2.5  times  the  average  pitch.  Frequency  selection  was 
accomplished  by  choosing  the  peaks  within  the  spectogram  as  illustrated  in  Figure 
3.  This  selection  technique  was  used  for  both  voiced  and  unvoiced  speech  (9). 

McAulay  and  Quatieri  also  used  used  a  sinusoidal  birth/death  philosophy  for 
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frame- to-frame  peak  matching.  In  this  system,  a  frequency  in  frame  k  is  compared 
to  all  frequencies  in  frame  k  +  1  in  an  attempt  to  find  a  frequency  in  frame  k  +  1 
within  a  "matching  interval”  A.  If  a  match  is  found,  then  the  frequency  continues  to 
be  a  selected  frequency  peak.  If  more  than  one  match  is  found,  a  tiebreaking  system 
defines  which  frequency  is  the  best  match.  If  a  match  is  not  found,  the  frequency 
is  declared  dead.  After  all  frequencies  from  frame  k  are  matched  to  the  frequencies 
in  frame  k  +  1,  or  are  declared  dead,  any  left  over  frequencies  in  frame  k  +  1  are 
declared  bom.  (9:748) 

Alenquer  and  Bashir  choose  to  use  two  different  frequency  selection  techniques: 
one  for  voiced  and  another  for  unvoiced  speech.  Initially,  the  speech  was  digitized 
using  an  analog  to  digital  converter.  This  data  was  then  Hamming  windowed  and 
a  512-point  DFT  taken.  Frequency  selection  for  voiced  speech  was  based  on  an 
estimate  of  the  speaker’s  glottal  frequency.  The  rule  based  selection  began  by  finding 
the  highest  energy  frequency  in  a  region  where  the  glottal  frequency  was  expected 
to  be.  Subsequent  frequencies  were  chosen  by  selecting  the  highest  energy  frequency 
within  four  nearest  neighbors  of  harmonics  of  the  glottal  frequency  estimate.  A 
typical  spectogram  of  speech  before  and  after  frequency  selection  are  shown  in  Figure 
4.  Unvoiced  speech  frequency  selection  was  accomplished  by  choosing  all  frequencies 
above  a  variable  energy  threshold.  The  threshold  was  varied  up  and  down  until  the 
desired  number  of  frequencies  were  selected.  The  threshold  also  was  attenuated  6 
dB  per  octave  above  625  Hz  to  compensate  for  the  fact  that  energy  in  speech  falls 
off  at  this  rate  (1,  2). 

2.5  Summary 

This  chapter  has  shown  that  speech  coding  may  be  done  by  several  different 
methods.  These  methods  can  be  categorized  as  either  analysis/synthesis  coding 
or  waveform  coders.  Coders  which  use  the  spectral  content  of  speech  to  perform 
analysis/synthesis  coding  are  called  Vocoders.  Speech  spectrum  analysis  can  be 
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performed  in  many  ways.  McAulay  and  Quatieri  choose  to  select  the  spectral  peaks 
for  speech  analysis.  Alternatively,  Alenquer  and  Bashir  choose  to  select  frequencies 
which  were  neighbors  of  harmonics  of  the  glottal  freqik.icy.  Chapte  •  three  will 
explain  the  experiment  environment  lor  this  thesis. 


III.  Development  Environment 


3.1  Introduction 

-  This  chapter  describes  the  hardware  and  software  environment  used  in  the 
development  of  the  speech  processing  system.  The  first  section  describes  the  setup 
used  for  speech  recording  and  processing.  The  software  development  environment 
is  explained  in  the  following  section.  The  final  section  shows  the  environment  for 
playing  back  the  reconstructed  and  original  speech  files.  A  block  diagram  of  the 
environment  is  included  in  the  final  section. 

3.2  Speech  Recording  and  Processing 

Analog  speech  signals  were  input  to  the  system  through  a  noise  reducing  mi- 

i 

crophone.  A  Rockland  Model  852  filter  set  to  low  pass  filter  at  6000  Hz  was  placed 
between  the  microphone  and  Digital  Sound  Corporation  Analog  to  Digital  (A/D) 
Converter.  The  6000  Hz  setting  prevented  aliasing  at  the  16,000  samples  per  second 
sampling  rate.  The  A/D  converter  digitized  the  analog  signal  to  integer  format  from 
-32,768  to  32,767.  The  data  was  then  stored  in  a  VaxStation  II  for  input  to  the 
speech  processing  system. 

3.3  Software  Environment 

The  speech  processing  system  was  written  in  the  ‘C’  program  language  on  a 
VAXWorkstation  Tf  running  the  VMS  version  5.1  operating  system.  The  program 
was  modularly  designed  for  ease  of  program  redesign  and  intermediate  data  extrac¬ 
tion.  Some  computer  code  Alenquer  and  Barmore  wrote  for  their  thesis  efforts  is 
modified  and  used  in  the  speech  processing  system.  The  computer  code  is  presented 
in  appendix  B. 
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3.4  Speech  Playback 


After  the  speech  files  were  run  on  the  system  the  reproduced  files  were  played 
back  for  listening  tests.  The  speech  files  stored  on  the  VAXstation  II  were  input 
to  the  DSC  200  Digital  to  Analog  (D/A)  Converter  for  playback.  The  playback 
signal  was  routed  through  the  Rockland  filter.  The  filter  was  set  to  low  pass  filter  at 
4000  Hz  because  the  final  design  of  the  speech  processing  system  does  not  output  any 
frequencies  above  4000  Hz.  The  output  was  played  on  either  a  speaker  or  headphones. 
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IV .  Speech  Processing  System 


4-1  Introduction 

The  speech  processing  system  is  a  computer  program  designed  to  simulate 
the  transmission  of  speech  signals  at  low  data  rates.  The  system  simulates  the 
transmitter  and  receiver  but  not  the  effects  of  the  transmission  channel.  The  system 
was  designed  in  three  major  modules:  analysis,  coding,  synthesis.  The  analysis 
module  begins  by  breaking  the  speech  signal  into  many  small  time  slice  signals. 
The  time  slices  are  then  Fourier  transformed  and  analyzed  to  extract  a  minimized 
amount  of  information  in  the  frequency  domain.  The  coding  module  then  quantizes 
the  information  extracted  by  the  analysis  module  for  digital  signal  transmission.  The 
final  module,  synthesis,  then  regenerates  an  estimate  of  the  original  speech  signal 
from  the  reduced  set  of  information. 

4-1.1  Time  IVame  Generation  Sample  rates  of  16  thousand  samples  per 
second  and  8  thousand  samples  per  second  were  tried  in  the  development  of  the 
system.  Most  speech  spectral  energy  is  generally  considered  to  exist  at  frequencies 
below  4  kHz.  However,  since  the  system  was  evaluated  on  its  ability  to  reproduce 
quality  speech,  as  much  of  the  original  speech  energy  as  possible  was  desired  to 
ensure  all  important  spectral  components  were  passed.  Therefore,  energy  spectral 
components  above  4  kHz  were  passed.  As  expected,  aliasing  occurred  when  spectral 
components  above  4  kHz  were  sampled  at  8,000  samples  per  second.  Therefore,  the 
16  thousand  samples  per  second  sample  rate  was  used.  In  addition,  the  input  speech 
signals  were  low  pass  filtered  at  6  kHz  t,o  prevent  any  possibility  of  aliasing  when 
using  the  higher  sample  rate. 

Time  slices  of  30  to  40  ms  were  desired  because  speech  spectral  components  are 
generally  stationary  for  this  period  of  time  (5).  In  order  to  obtain  these  time  slice 
period-;  at  a  16  thousand  samples  per  second  rate,  512  to  640  samples  were  taken  to 
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Figure  6.  Time  Slice  Consisting  of  512  Sample  Points 

be  one  time  slice.  Each  time  slice  therefore  represented  32  to  40  ms  of  speech.  An 
example  of  a  512  sample  time  frame  is  ohown  in  Figure  6. 

4-1.2  Windowing  Each  time  slice  was  windowed  both  at  the  input  to  the 
system  and  at  the  output  of  the  system.  The  input  windowing  was  accomplished  to 
prepare  the  data  for  the  Fast  Fourier  Transform  (FFT).  Windowing  before  the  FFT 
helps  reduce  side  lobe  spectral  leakage  which  is  due  to  the  finite  observation  interval. 
The  window  chosen  for  the  input  was  the  sine  window  function  seen  in  Figure  7. 

W(n)  =  sin(mr/N )  (3) 

Where: 

n  =  0,1,2,  ...iV-  1 

Each  time  slice  was  also  windowed  at  the  output.  This  window  was  needed 
because  the  system  is  a  non-linear  rule  based  system;  therefore,  each  time  slice  is 


Sine  Window  Function 


Figure  7.  Sine  Window  Function  with  512  Points 

acted  upon  differently  by  the  system.  This  causes  amplitude  discontinuities  when  the 
time  slices'  are  added  back  together.  Without  output  windowing,  the  discontinuities 
at  the  time  slice  additions  caused  a  60  Hz  “purring  noise”  in  the  reconstructed  speech. 
The  60  Hz  noise  corresponds  to  a  16  ms  occurrence  rate;  the  rat  j  at  which  the  time 
slices  were  added  together.  The  window  function  gradually  adds  in  each  new  time 
slice,  thereby  smoothing  the  discontinuities. 

The  original  window  function  chosen  for  the  system  was  the  hamming  window. 

W(n)  =  0.54  —  0A6cos(2nn/N)  (4) 


Where: 

n  =  0, 1, 2,  ...N  —  1 

The  Hamming  window,  shown  in  Figure  8,  had  the  desired  effect  of  reducing 
the  spectral  leakage  when  used  at  the  input  to  the  system.  However,  when  this 
window  function  is  used  as  the  window  function  for  the  input  and  output,  the  repro- 
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Figure  8.  Hamming  Window  Function  with  512  Points 


Figure  9.  Overall  Hamming  Window  Scheme  Function 


duced  speech  was  modulated  with  a  60  Hz  sinusoid.  This  effect  is  explained  by  the 
calculation  shown  in  Figure  9.  The  calculation  represents  the  overall  effect  of  the 
windowing  scheme.  The  window  scheme  consists  of  the  window  function  squared, 
due  to  the  multiplication  at  the  input  and  output,  and  added  to  itself  with  a  50 
percent  overlap  (50  percent  overlap  is  explained  in  the  next  section).  As  shown,  the 
resulting  function  is  a  60  Hz  sinusoid.  Because  of  the  distortion  introduced  by  the 
Hamming  window,  a  different  function  whose  overall  window  function  which  did  not 
add  noise  to  the  system,  namely  the  sine  window. 

The  overall  effect  on  the  system  by  the  sine  window  is  multiplication  by  unity. 
The  sine  window  effect  was  verified  by  reconstructing  speech  with  all  frequency 
components.  The  speech  was  indistinguisable  from  the  original  recorded  speech. 

4-1.3  Frame  Overlapping  A  50  percent  frame  overlap  was  used  in  the 
speech  analysis.  The  overlap  is  needed  because  of  the  characteristics  of  the  win¬ 
dowing  function.  Since  the  window  function  is  near  zero  at  the  endpoints,  data  at 
the  endpoints  is  ignored  if  some  type  of  overlapping  scheme  is  not  used.  The  most 
common  overlap  schemes  are  33  percent  and  50  percent.  Since  the  data  rate  of  the 
system  is  inversely  proportional  to  the  overlap,  to  help  minimize  the  data  rate,  the 
50  percent  overlap  was  used. 

4.I.4  Fast  Fourier  Transform  After  the  timeframes  were  generated,  over¬ 
lapped,  and  windowed,  they  were  input  into  a  Fast  Fourier  Transform  (FFT)  to  ob¬ 
tain  the  spectogram  of  the  time  slice.  During  design  256,  512,  and  1024-point  FFT’s 
were  tried  in  the  system.  Sample  rate,  number  of  samples  per  frame,  and  FFT  size 
interact  throughout  the  system.  According  to  Harris  (4)  FFT  bin  size  is  defined  to 
be: 


A/  =  a(fJN) 
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Sine  Windowed  Time  Slice 


Figure  10.  Windowed  Timeslice  with  Zero  Stuffing 

Where: 

•  A /  is  the  FFT  bin  width. 

•  a  is  coefficient  reflecting  bandwidth  increase  due  to  windowing. 

•  ft  is  the  sample  rate, 

•  N  is  the  FFT  size. 

The  system  goal  was  detection  and  estimation  of  important  spectral  compo¬ 
nents  of  speech.  Therefore,  a  narrow  bin  width  was  desired.  In  order  to  shrink  the 
bin  size,  a  large  FFT  size  was  needed.  However,  the  32  to  40  ms  timeslice  duration 
times  and  the  16  thousand  sample  per  second  sample  rate  fixed  the  number  of  sam¬ 
ples  per  time  slice  between  512  and  640  samples.  Thus,  either  the  FFT  size  must  be 
512  dr  zero  padding  had  be  used. 

As  discussed  by  Marple,  zero  padding  causes  the  discrete-time  Fourier  series 
to  interpolate  transform  values  between  the  original  N  values.  Thus,  not  only  does 
zero  stuffing  match  the  transform  size  to  the  number  of  input  values,  it  also  gives 
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an  “apparent”  increase  in  transform  resolution  (8).  In  addition,  subjective  listening 
tests  showed  that  the  1024  point  FFT  performed  best  at  reconstructing  the  speech. 
Therefore  the  1024  point  FFT  was  chosen  for  the  system.  An  example  of  a  sine 
windowed  and  zero  padded  input  to  the  FFT  is  shown  in  Figure  10. 

4.1.5  Frequency  Selecting  Bashir  reconstructed  speech  by  estimating  the 
glottal  frequency,  then  selecting  all  harmonics  of  the  glottal  frequency  for  use  in 
reconstruction.  He  found  that  using  exact  harmonics  of  the  glottal  frequency  caused 
a  “musical  noise”  to  corrupt  the  reconstructed  speech.  In  order  to  eliminate  this 
effect,  he  used  a  “nearest  neighbor”  rule  to  prevent  exact  harmonics  of  the  glottal 
frequency  from  always  being  selected  for  reconstruction.  The  “nearest  neighbor”  rule 
consist  of  estimating  the  glottal  as  before,  then  searching  the  N  nearest  neighbors  of 
the  glottal  harmonics  to  see  which  frequency  component  has  the  highest  energy.  The 
highest  energy  frequency  component  in  the  area  is  then  selected  for  reconstruction 
of  the  speech.  Bashir  was  able  to  reconstruct  speech  which  was  indistinguishable 
from  the  original  recorded  speech  using  this  selection  scheme.  (2) 

Alenquer  initially  selected  the  same  components  for  speech  reconstruction  as 
Bashir  by  using  the  nearest  neighbor  rule.  Alenquer’s  goal  was  to  code  speech  with 
as  low  a  data  rate  as  possible.  Therefore,  his  next  step  was  to  eliminate  as  many  of 
the  components  selected  as  possible  without  severely  distorting  the  speech.  Alenquer 
chose  to  send  the  first  N,  N  being  the  number  of  components  sent  per  time  slice, 
harmonics  of  the  glottal  frequency  chosen  using  the  nearest  neighbor  rule.  Alenquer 
found  that  the  system  performed  best  using  a  four  nearest  neighbor  rule.  He  was 
able  to  reconstruct  speech  using  16  frequency  components  and  a  18  kbps  data  rate. 
(1) 

As  Alenquer  did,  this  thesis  began  by  pruning  the  frequency  components  which 
would  later  be  used  to  select  a  smaller  number  of  components  for  transmission.  Many 
different  frequency  component  selection  schemes  were  tried  in  the  system  including: 
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•  Choose  greatest  energy  frequencies  in  the  spectogram. 

•  Choose  greatest  energy  frequencies  from  a  low  pass  filtered  (smoothed)  spec¬ 
togram. 

•  Approximate  the  glottal  pitch;  then  choose  glottal  pitch  harmonics  using  four 
nearest  neighbor  rule. 

•  Approximate  the  glottal  pitch;  then  choose  highest  energy  glottal  pitch  har¬ 
monics  using  the  nearest  neighbor  rule  and  using  compensation  factors  to  ac¬ 
count  for  6  dB  per  octave  rolloff  in  speech  energy  above  600Hz. 

Limited  research  time  necessitated  the  choice  of  one  selection  scheme  for  research  in 
the  system.  Again,  preliminary  subjective  listening  tests  were  used  to  select  which 
scheme  performed  the  best.  The  selection  scheme  using  the  compensation  factor 
performed  best  and  was  selected  for  the  system. 

The  6dB  per  octave  decrease  in  energy  in  speech  was  compensated  for  by  gen¬ 
erating  a  decision  vector.  The  decision  vector  was  generated  by  initially  copying 
the  amplitude  spectra  vector  into  the  decision  vector.  The  vector  was  then  multi¬ 
plied  by  a  compensation  vector,  shown  in  Figure  11,  which  consisted  of  components 
increasing  exponentially  at  6  dB  per  octave.  The  functions  used  to  generate  the 
compensation  vector  are: 


c(n)  =  1 
c(n)  =  e-0355*"-38) 
c(n)  =  4e-018(n“77) 
c(n)  =  lee-009^"154) 


n  =  0,1. .37 
n  =  38,39,  ..76 
n  =  77, 78,  ..154 
n  =  155, 156,  ..256 


(6) 
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Quantization  Levels  for  0  to  600  Hz 
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Figure  11.  Compensation  Vector  to  Compensate  for  Energy  Drop-off  in  Speech 

An  additional  problem  was  found  while  experimenting  with  selection  schemes. 
That  is,  strong  formants  were  usually  several  selected  harmonics  wide.  Therefore 
many  adjacent  components  were  selected  representing  the  same  formant,  causing  the 
spectra  to  be  inadequately  sampled.  This  problem  was  greatly  reduced  by  the  use  of 
a  neighborhood  inhibiting  scheme.  The  neighborhood  inhibiting  scheme  consists  of 
selecting  components  from  the  reduced  frequency  set  using  the  decision  vect  or  then 
multiplying  the  four  nearest  neighbors  of  the  selected  frequency  by  .75  to  inhibit 
them  from  being  selected.  Examples  of  spectra  and  frequencies  selected  for  voiced 
and  fricative  speech  are  shown  in  Figures  12,  13,  14  and  15. 

4-2  Coding  System 

The  second  portion  of  the  speech  processing  system  is  the  coding  module.  The 
components  which  are  to  be  coded  are: 

1.  The  frequency  of  the  selected  components. 
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Figure  12.  FFT  Output  with  Voiced  Input 
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Figure  13.  Voiced  Frequency  Selection 
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Figure  14.  FFT  Output  with  Frio.,ive  Input 
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Figure  15.  Fricative  Frequency  Selection 


2.  The  amplitude  of  the  transmitted  frequency  components. 

3.  The  phase  of  the  transmitted  frequency  components. 

4-2.1  Amplitude  Quantization  The  amplitude  of  the  transmitted  frequency 
components  were  linearly  quantized  as  integers  between  zero  and  an  empirically  de¬ 
termined  maximum  value.  For  the  system  described  in  this  thesis,  the  maximum 
value  was  found  to  be  800, 000.  This  value  is  a  function  of  the  multiplicative  con¬ 
stants  of  the  FFT  and  other  subroutines. 

Again,  the  fact  that  speech  energy  drops  off  at  a  rate  of  6dB  per  octave  was 
used  to  match  the  speech  processing  system  to  speech  characteristics.  This  was 
done  by  adaptively  adjusting  the  quantization  step  size  of  each  frequency  by  the 
inverse  value  of  the  compensation  value  used  in  the  frequency  selection  section  of 
the  program.  Thus,  decreasing  the  step  size  and  the  maximum  quantization  error 
for  the  higher  frequency  values.  An  example  of  quantization  step  size  rolloff  is  shown 
in  Figure  16.  As  shown,  if  the  quantization  step  size  were  400  for  0  to  600  Hz,  the 
step  size  would  decrease  to  100  at  the  end  of  the  first  octave  (1200  Hz)  and  would 
further  decrease  to  25  by  th<s  second  octave  (2400  Hz). 

4-2.2  Phase  Quantization  The  phase  value  of  the  transmitted  frequency 
components  were  linearly  quantized  between  0  and  27r.  For  the  final  system,  all  phase 
components  are  quantized  and  sent.  However  several  different  phase  transmission 
combinations  were  tried  including: 

1.  Send'  ag  the  phase  of  the  first  component  and  setting  all  other  phase  values  to 
zero. 

2.  Sending  the  phase  of  the  four  lowest  frequency  components  and  setting  all 
other  phase  values  to  zero. 

3.  Sending  the  phase  values  of  all  components  other  than  the  four  lowest  frequen¬ 
cies  and  setting  all  other  phase  values  to  zero. 
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EXAMPLE  OF  AMPLITUDE  QUANTIZATION 
STEP  SIZE  (S)  ROLLOFF 


AMP 


S  =  100/4  =  25 


Figure  16.  Example  of  Adaptive  Amplitude  Quantization  Scheme 
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All  of  the  above  schemes  caused  the  reproduced  speech  to  sound  as  if  the 
speaker  was  hoarse  from  a  bad  cold. 

4 .2.8  IVequericy  Coding  In  order  to  transmit  the  frequency  location  of  the 
selected  component  both  the  harmonic  and  the  jitter  components  had  to  be  trans¬ 
mitted.  As  described  in  the  frequency  selection  section,  the  speech  analysis  system 
searches  for  the  highest  energy  frequency  in  the  neighborhood  of  glottal  frequency 
harmonics.  The  frequency  location  of  the  selected  component  is  the  number  of  the 
selected  harmonic  and  the  amount  of  difference  from  the  exact  harmonic. 

bin  =  (h*g)  +  j  (7) 

where: 

•  h  =  The  number  of  harmonics  of  the  glottal  bin  between  the  transmitted 
harmonic  and  the  previous  transmitted  harmonic. 

•  g  =  The  glottal  frequency  bin  for  the  timeslice. 

•  j  =  The  jitter  or  difference  between  the  exact  harmonic  and  the  frequency 
selected. 

For  example,  if  the  glottal  frequency  bin  is  found  to  be  the  8th  bin,  a  possible 
frequency  would  be  the  10th  harmonic  with  a  jitter  of  —1.  In  this  example  the 
frequency  bin  would  be  interpreted  to  be  (8  *  10)  —  1  =  79th  frequency  bin. 

In  order  to  transmit  the  frequency  in  this  manner  the  output  signal  had  to  be 
limited  to  a  maximum  number  of  bins.  In  order  to  transmit  the  harmonic  number 
with  5  bits  the  harmonic  number  had  to  be  limited  to: 

(c-1)  *7  +  32*7  (8) 
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Equation  8  represents  the  worst  case  frequency  transmission  situation.  In  the  situa¬ 
tion  all  but  one  frequencies  are  as  close  together  as  possible  and  the  last  frequency  is 
as  far  away  from  the  others  as  possible.  Each  7  in  the  equation  represents  the  lowest 
bin  in  which  the  system  searches  for  the  glottal  frequency.  The  first  7  represents  the 
glottal  frequency  being  found  in  the  7th  FFT  bin.  The  32*7  represents  the  maximum 
distance  frequency  bin  which  can  be  selected  with  5  bits  and  the  the  glottal  bin  = 
7.  c  represents  the  number  of  components  sent.  The  (c  —  1)  calculates  the  smallest 
number  of  bins  covered  by  the  first  c  —  1  frequencies.  If  the  system  is  not  limited  to 
this  number  of  bins,  it  is  possible  for  one  of  the  frequencies  selected  to  be  more  than 
32  multiples  of  the  glottal  frequency  away  from  the  previous  selected  frequency,  thus 
causing  an  error  at  the  receiver.  The  system  typically  used  8  components  and  was 
therefore  limited  to  273  frequency  bins.  Each  bin  is  15.625 Hz  wide  therefore  limiting 
to  273  bins  corresponds  to  4265  Hz  bandlimiting.  The  final  system  design  limited 
the  output  speech  energy  to  frequencies  below  4000  Hz.  The  maximum  number  of 
bins  was  therefore  limited  to  256,  well  below  the  273  maximum  derived  above. 

4.3  Energy  Conservation 

Speech  signals  which  were  regenerated  from  a  small  number  of  spectral  con* 
ponents  had  a  fluctuation  in  volume.  The  reason  for  the  fluctuation  was  determined 
to  be  that  the  amount  of  energy  in  the  selected  frequency  components  varied  greatly 
from  frame  to  frame.  The  volume  fluctuation  was  particularly  evident  at  voiced- 
unvoiced  boundaries.  The  spectra  of  the  voiced  frames  had  small  areas  where  the 
energy  of  the  frame  was  concentrated.  However,  the  unvoiced  frame  spectra  had 
a  relatively  uniform  distribution.  When  small  numbers  of  frequency  energies  were 
selected  for  transmission,  a  significantly  smaller  amount  of  energy  was  selected  from 
the  unvoiced  frames.  The  result  was  weak  sounding  fricatives  and  as  mentioned 
before  a  fluctuation  in  the  volume  from  frame  to  frame. 

To  compensate  for  the  volume  fluctuations,  the  amount  of  energy  in  the  chosen 
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Baginning  of  •  Word 
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Figure  17.  Beginning  of  Word  Reproduced  with  all  Frequency  Components 

spectral  components  was  amplified  so  that  the  amount  of  energy  in  the  reconstructed 
speech  was  equal  to  the  energy  in  the  original  timeframe.  The  amount  of  amplifica¬ 
tion  varied  from  frame  to  frame  and  was  calculated  to  be: 


d  = 


^255  r7 
2-<n=0  CN 
N-l  _2 
ck 


(9) 


Where: 

N  is  the  number  of  spectral  components  selected  for  the  timeslice. 
X)n=o  cn  1S  energy  in  the  original  timeslice. 

Zk=o  ck  *s  the  energy  in  the  selected  spectral  components. 

The  result  of  this  operation  was  higher  volume  fricatives. 


4-4  Voiced-Unvoiced  Thresholding 
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Baginning  of  i  Word 


Figure  18.  Beginning  of  Word  Reproduced  with  Three  Frequency  Components 

During  preliminary  subjective  listening  tests,  listeners  complained  about  a 
ringing  noise  during  unvoiced  portions  of  the  reproduced  speech.  Inspection  of  the 
waveform  in  unvoiced  regions  revealed  that  fricative  signals  were  riding  on  top  of 
lower  frequency  components  which  were  not  evident  in  the  original  time  waveform 
(see  Figures  17  and  18).  The  lower  frequency  on  which  the  fricatives  were  riding 
was  approximately  the  same  as  the  glottal  frequency.  The  speech  processing  system 
always  transmits  a  frequency  which  was  believed  to  be  the  glottal  frequency  and 
this  was  the  corrupting  lower  frequency  in  the  fricatives.  To  reduce  the  ringing  in 
the  fricatives,  lower  frequencies  had  to  be  eliminated  from  the  unvoiced  portion  of 
the  signal.  To  eliminate  the  ringing  a  decision  had  to  be  made  as  to  whether  each 
timeslice  was  voiced,  unvoiced,  or  blank  space.  The  decision  was  based  on  the  ampli¬ 
tude  of  the  amplitude  of  the  glottal  frequency  estimate  which  was  made  in  all  three 
classes  of  time  waveform.  The  decision  thresholds  were  empirically  determined  to 
be: 
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1.  If  the  amplitude  of  the  glottal  frequency  estimate  is  above  100000  the  timeslice 
is  determined  to  be  voiced  speech. 

2.  If  the  amplitude  of  the  glottal  frequency  estimate  is  between  35000  and  100000 
the  region  is  determined  to  be  unvoiced  speech. 

3.  If  the  amplitude  of  the  glottal  frequency  estimate  is  below  35000  the  region  is 
blank  space. 

Thresholding  also  eliminated  noise  which  had  been  generated  in  the  blank  space 
due  to  the  energy  conserving  subroutine  putting  all  of  the  frames  energy  into  a  few 
frequencies.  Only  voiced  and  unvoiced  speech  frames  were  sent  into  the  energy  con¬ 
servation  subroutine.  Thresholding  decreased  noise,  however,  it  did  not  completely 
eliminate  the  ringing  noise.  The  decision  failed  if  a  timeslice  was  half  voiced  and 
half  unvoiced. 

4-5  Data  Rate 

The  data  rate  of  the  system  was  found  to  be: 

Sr  *  L  *  (A  +  P  +  F) 

R*N ; 

where: 

Sr  is  the  sample  rate  of  the  A  to  D  converter. 

L  is  the  number  of  selected  frequency  components  used  for  reconstructing. 

A  is  the  number  of  bits  used  to  quantize  amplitude  information. 

P  is  the  number  of  bits  used  to  quantize  phase  information. 

A  is  the  number  of  bits  used  to  quantize  frequency  information. 

R  is  the  percent  overlap  used  for  the  FFT  (50%). 

Ns  is  the  number  of  sample  points  taken  to  be  one  timeslice. 


(10) 


Figure  19.  Original  Fricative  Waveform 


4.6  Synthesis  System 

After  the  signal  was  analyzed  and  coded,  the  estimate  of  the  speech  was  gener¬ 
ated  with  the  sinusoidal  speech  model  described  by  Quatieri  and  McAulay  (9).  The 
estimate  was  generated  using  the  reconstruction  model 

N- 1  mar/  n  .  r 

*(»)--=  £  £  ^/cos(^+f)  (n) 

<=0  /= 0 

where: 


s(n)  is  the  reconstructed  speech  signal. 
maxf  is  the  maximum  allowed  frequency  bin  for  the  system. 

Ay  is  the  quantized  amplitude  of  the  selected  frequency  component. 
<f>  is  the  quantized  phase  of  the  selected  frequency  component. 
t  is  the  time  index  for  the  timeslice. 

/  is  the  frequency  index. 


Figure  20.  Reconstructed  Fricative  Waveform 


Figure  21.  Original  Vowel  Waveform 
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Figure  22.  Reconstructed  Vowel  Waveform 
N  is  the  number  of  samples  per  timeslice. 

Examples  of  an  original  and  reproduced  voiced  waveform  are  shown  in  Figures  21 
and  22.  The  reconstructed  vowel  was  generated  using  16  frequency  components. 
Note  that  the  reconstruction  is  not  exact,  however  the  waveforms  are  very  similar. 
An  example  of  an  original  fricative  waveform  and  its  reconstruction  are  shown  in 
Figures  19  and  20.  Here  the  waveforms  are  not  as  similar  as  the  reconstructed  vowel 
is  to  its  original  waveform,  yet  the  fricatives  in  the  utterance  are  clear. 
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V.  System  Testing 


5.1  Introduction 

There  are  several  types  of  speech  reproduction  success  measures.  Some  of  these 
attempt  to  measure  the  success  of  reconstruction  by  calculating  the  absolute  error 
in  the  signal  reconstruction.  These  measures  are  useful  because  a  very  low  error 
in  signal  reconstruction  would  have  a  very  good  chance  of  being  accepted  by  the 
human  listener.  However,  there  could  be  cases  of  a  large  signal  error  which  does  not 
insult  the  human  ear  at  all.  There  axe  two  measures  which  describe  the  success  of 
a  coding  system’s  ability  to  interface  with  human  listeners.  The  first  is  ’’quality”  of 
reproduction.  This  is  a  measure  of  how  natural  the  reproduction  sounds  to  listeners; 
including  how  much  the  reproduction  sounds  like  the  speaker.  The  second  measure 
is  “intelligibility”  of  the  reproduction.  Intelligibility  is  the  ability  of  the  system  to 
reproduce  speech  such  that  the  listener  can  distinguish  what  word  was  sent  from 
other  similar  words.  For  example,  can  the  listener  tell  the  difference  between  “cat” 
and  “sat”  being  sent. 

There  is  no  single  test  which  measures  both  speech  quality  and  intelligibility. 
The  test  which  measures  speech  intelligibility  is  the  Modified  Rhyme  Test  (MRT). 
Due  to  the  size  of  the  files  required  to  run  the  MRT  on  the  speech  processing  system, 
it  was  not  feasible  to  run  the  MRT  0.1  the  system.  A  set  of  subjective  listening  tests, 
designed  to  measure  the  quality  of  the  speech  reconstruction  verses  the  design  pa¬ 
rameters  were  run  on  the  system.  In  these  listening  tests,  listeners  were  asked  to  rate 
the  quality  of  the  reproduced  speech  with  different  design  parameter  settings.  How 
the  parameters  were  varied  will  be  discussed  later  in  this  chapter.  The  parameter 
settings  were  selected  because  by  the  performance  of  the  parameters  in  the  listening 
tests. 
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5.2  Listening  Tests 

Listening  tests  were  performed  on  the  speech  processing  system  to  measure 
how  each  of  the  design  parameters  affected  the  human  perceived  quality  of  speech. 
The  tests  consisted  of  having  seven  people  listen  to  a  test  tape  which  consisted  of 
several  different  reproductions  of  the  same  utterance.  Bach  of  the  reproductions  were 
the  result  of  the  speech  processing  system  operating  on  the  same  digitized  speech 
input  file. 


5.2.1  Test  Tape  As  described  in  the  data  fate  section  of  this  thesis,  the 
data  rate  of  the  system  was  found  to  be: 

Sr*  L*(A  +  P  +  F)  /-9v 

R*N,  1  } 

where: 

Sr  is  the  sample  rate  of  the  A  to  D  converter. 

L  is  the  number  of  selected  frequency  components  used  for  reconstructing. 

A  is  the  number  of  bits  used  to  quantize  amplitude  information. 

P  is  the  number  of  bits  used  to  quantize  phase  information. 

F  is  the  number  of  bits  used  to  quantize  frequency  information. 

R  is  the  percent  overlap  used  for  the  FFT  (50 

Nt  is  the  number  of  sample  points  taken  to  be  one  timeslice. 

The  parameters  to  be  adjusted  in  the  listening  tests  were  L,  A,  P,  and  F. 

The  test  tape  was  partitioned  into  three  sections.  The  first  section  tested 
the  variation  of  the  parameter  L,  the  number  of  frequency  components  selected  to 
reconstruct  the  timeslice.  In  this  section  of  the  tape  the  amplitude  and  phase  (A 
and  P)  were  transmitted  unquantized.  Therefore  all  of  the  error  in  the  reconstructed 
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speech  signal  was  due  to  the  lack  of  certain  frequency  components.  The  section 
began  with  a  reconstruction  of  the  utterance  "testing  one  two  three”  using  only  two 
frequency  components.  The  section  continued  with  reproductions  using  3, 4,  5,  6,  7, 
8,  9, 10,  15  and  30  frequency  components. 

The  second  portion  of  the  tape  was  the  section  which  tested  the  response  to 
the  variation  of  the  parameter  P,  the  number  of  bits  used  to  quantize  the  phase 
information.  In  this  section  the  amplitude  was  unquantized  and  the  number  of 
frequency  components  was  set  to  30.  The  phase  variation  section  of  the  tape  began 
with  a  reconstruction  of  the  speech  signal  using  one  phase  bi*,  or  zero  crossing  phase. 
The  section  continued  with  reconstructions  using  2,  3,  4,  5,  6,  7,  and  8  bit  phase. 

The  final  section  of  the  test  tape  was  the  variation  of  the  parameter  A,  the 
number  of  bits  used  to  quantize  the  amplitude  information.  In  this  section  the  phase 
information  was  unquantized  and  the  number  of  frequency  components  selected  was 
set  to  30.  Like  the  pin  variation  portion,  the  amplitude  portion  of  the  tape  began 
with  a  speech  reconstruction  using  only  one  bit,  or  on/off  amplitude  information. 
The  section  continued  with  reconstructions  using  2,  3,  4,  5,  6,  7,  and  8  bit  amplitude 
quantization. 

5.2.2  Test  Procedure  The  listening  test  was  run  on  seven  volunteers.  Each 
volunteer  selected  had  not  been  a  preliminary  listing  test  subject  or  had  they  repeat¬ 
edly  listened  to  the  speech  reproductions  for  any  other  reasons.  After  volunteering 
they  were  given  the  following  instructions. 


Thank  you  for  volunteering  for  the  listening  test  portion  of  the  Frequency 
Domain  Speech  Coding;System.  The  test  will-take  approximately  5  min¬ 
utes  each  day  for  three  days.  The  following  is  a  list  of  the  information 
which  will  be  helpful  in  performing  the  test. 

1.  There  are  27  utterances  which  have  been  reproduced  by  a  data  re¬ 
duction  coding  system  and  recorded  on  a  cassette  tape.  The  tape 
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will  be  played  for  you.  Each  utterance  is  separated  from  the  next 
utterance  by  about  one/second. 

2.  After  each  utterance  reproduction,  rate  the  quality  of  the  repro¬ 
duction  on  the  rating  sheet  provided.  Please  note  the  scale  on  the 
rating  sheet;  that  is,  0  means  you  could  not  understand  the  speech, 
7  means  the  reproduction  sounds  as  good  as  telephone  quality. 

3.  Please  remember  that  you  are  asked  to  return  to  perform  the  tests 
three  times, 

Again  thank  you  for  volunteering. 

Vance  McMillan 


After  reading  the  instructions  they  were  given  a  rating  sheet  and  asked  if  they 
understood  the  instructions  and  rating  sheet.  No  other  instructions  or  information 
was  given.  An  example  grading  sheet  is  shown  on  page  45. 


Utt 

Num 


Scale 


0  =  Could  not  understand 
7  =  Telephone  quality 
10  =  Sounds  like  origional  utterance 
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VI.  Test  Results 


6.1  Introduction 

In  this  chapter  the  overall  results  of  the  listening  tests  are  presented.  The 
complete  set  of  listening  test  results  are  given  in  appendix  A.  The  overall  result 
curves  represent  the  average  test  results  from  all  seven  listeners  and  all  three  test 
measurements. 

6.2  Quality  Verses  Number  of  Frequencies  used  in  Reconstruction 


Figure  23.  Quality  vs  Frequency  Components  Curve 

Figure  23  shows-the  results-of  the  portion  of  the  listening  tests  aimed-at  mea¬ 
suring  the  quality  of  reconstruction  as  a  function  of  the  number  of  frequencies  used 
in  reconstruction.  The  overall  shape  of  the  curve  is  increasing  as  expected,  however 
there  are  some  erratic  points  on  the  curve  which  were  not  expected.  These  data 
points  may  be  an  artifact  of  the  order  of  the  reproductions  on  the  test  tape.  During 
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the  second  and  third  iterations  of  the  listening  tests,  the  order  of  the  reproductions 
was  scrambled  to  prevent  the  listeners  from  memorizing  the  pattern  of  the  response 
sheet,  and  thus  biasing  the  results.  With  the  order  changed,  the  listeners  tended  to 
rate  the  first  few  utterances  lower  than  the  first  time  through  the  test.  Some  of  the 
points  on  the  curve  which  are  lower  than  may  be  expected  correspond  to  the  first 
utterances  on  the  scrambled  version  of  the  test  tape. 

The  quality  verses  frequencies  plot  shows  a  sharp  increase  in  quality  until  the 
number  of  frequencies  reaches  about  eight.  After  this  point  only  about  one  point  in 
quality  is  gained  by  sending  up  to  thirty  frequencies.  Based  on  these  observations, 
the  design  parameter,  number  of  frequencies  used  in  reconstruction  (L),  was  set  to 
eight  frequencies. 

6.3  Quality  Verses  Number  of  Phase  Bits 


Figure  24.  Quality  vs  Number  of  Phase  Bits  Curve 
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Quality  verses  Number  of  Amplitude  Bits  (A) 


012345678 
Number  of  Bits 

Figure  25.  Quality  vs  Number  of  Amplitude  Bits  Curve 

The  quality  verses  phase  bits  curve  shown  in  Figure  24  increases  very  quickly 
to  its  maximum  value  at  about  three  bits,  then  remains  approximately  constant 
through  eight  bits.  The  curve  shows  that  low  quality  phase  information  gives  high 
quality  reproduction.  The  design  parameter  (P)  was  set  to  three  bits. 

6.4  Quality  Verses  Number  of  Amplitude  Bits 

The  final  curve,  shown  in  Figure  25,  shows  the  relationship  between  quality 
and  the  number  of  bits  representing  amplitude  information.  Like  the  phase  curve, 
the  amplitude  curve  rises  very  quickly.  The  quality  has  nearly  reached  its  maximum 
value  by  the  two  bit  quantization  point.  This  shows  the  amplitude  quantization 
scheme  performs  well  as  rated  by  listeners..  The  design  parameter  (A),  number  of 
amplitude  bits,  was  set  to  two  bits  based  on  the  information  shown  in  the  quality  vs 
amplitude  bits  curve. 
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6.5  System  Design  Results 

The  final  system  design  parameters  were  L  =  8,  A  =  2,  and  P  =  3.  In  addition 
to  these  parameters  the  timeframe  size  was  increased  to  40  ms  for  the  final  design. 
This  parameter  was  not  one  of  the  parameters  included  in  the  listening  tests  and 
therefore  had  to  be  selected  by  quality  comparisons  by  people  not  involved  in  the 
formal  listening  tests.  Increasing  the  size  above  40  ms  did  have  a  slurring  affect  on  the 
speech  according  to  the  listeners.  However,  little  slurring  was  noted  for  timeframes 
below  40  ms. 

These  parameters  resulted  in  a  bit  rate  of  4800  bps  in  the  final  design  of  the 
speech  processing  system.  The  reproduced  speech  from  the  design  sounded  very 
good  in  the  voiced  regions  and  good  in  the  unvoiced  regions.  The  system  did  have 
the  musical  noise  at  the  boundaries  of  voiced-unvoiced  regions.  There  are  several 
hypothesis  as  to  why  the  musical  noise  is  present  at  these  boundaries. 

1.  The  spectra  of  the  speech  is  changing  quickly  at  the  voiced-unvoiced  bound¬ 
aries  and  therefore  require  a  higher  sample  rate  to  prevent  a  time  aliasing 
affect.  This  could  also  be  explained  in  the  time  domain  by  realizing  that  these 
boundary  signals  are  too  complicated  to  be  approximated  from  a  few  frequency 
components 

2.  The  human  brain  is  able  to  fill  in  the  spectra  in  the  slowly  changing  regions. 
However,  the  brain  is  unable  to  fill  in  the  spectra  for  fast  changing  speech. 
The  musical  noise  is  the  brain’s  unsuccessful  attempt  to  fill  the  spectra  and 
therefore  the  noise  is  not  even  present  in  the  waveform(5). 

3.  The  speech  processing  system  makes  good  frequency  selections  to  represent 
voiced  or  unvoiced  speech  but  makes  poor  decisions  when  the  timeframe  is  a 
combination  of  the  two. 

A  system  design  validation  test  was  run  on  the  system  with  the  final  design 
parameters  installed.  The  test  consisted  of  having  three  of  the  listeners  used  in  the 
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listening  tests  rate  three  utterances  using  the  same  scale  used  in  the  listening  tests. 
The  three  utterances  were:  the  original  speech  file,  the  system  operating  with  the 
final  design  parameters,  and  the  30  frequency  reconstruction  with  no  quantization 
of  the  amplitude  or  phase  information.  The  results  were  8,  5.0,  and  5.7  respectively. 
The  results  show  that  the  original  speech  file  does  not  rate  a  ten  as  might  be  expected. 
This  may  be  the  result  of  the  recording  and  playback  headphone  distortion.  The  30 
frequency  reconstruction  was  used  as  a  comparator  to  see  how  the  results  correlated 
with  the  listening  test  results.  This  reproduction  rated  6.2  in  the  listening  test 
compared  to  the  5.7  in  the  validation  test,  well  within  the  expected  value  of  change 
by  eliminating  four  of  the  listeners.  The  5.0  rating  for  4800  bps  speech  is  encouraging. 
The  result  is  only  2  rating  points  below  toll  quality  speech  at  4800  bps. 
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VII.  Conclusions  and  Recommendations 


This  thesis  has  been  successful  in  refining  the  design  of  the  sinusoidal  based 
model  speech  coding  scheme.  The  adaptive  amplitude  quantization  design  was  vali¬ 
dated  by  the  listening  test  quality  verses  amplitude  bit  curve.  The  system  performed 
well  at  reproducing  speech  at  4800  bps  but  the  quality  rating  did  not  reach  toll  qual¬ 
ity  because  of  the  noise  in  the  voiced-unvoiced  speech  boundaries.  Without  the  noise 
it  may  be  possible  to  transmit  4800  bps  toll  quality  speech. 

The  next  step  in  the  refining  process  of  frequency  domain  speech  coding  should 
be  to  isolate  the  cause  of  the  musical  noise.  If  the  noise  is  caused  by  the  brain  trying 
to  fill  in  the  spectra  in  the  quickly  changing  speech  then  it  may  be  eliminated  by 
filling  in  the  spectra  at  the  receiver  with  an  spectrum  filling  function,  a  function 
which  fills  in  the  spectra  based  on  the  limited  components  available.  Efforts  in 
this  thesis  included  trying  to  fill  the  spectra  with  a  sine  filling  function  which  failed. 
Gaussian  filling  functions  were  also  tried  however  finding  a  standard  deviation  which 
does  not  introduce  noise  into  the  speech  was  not  accomplished. 

If  the  noise  is  due  to  decisions  by  the  system  at  voiced-unvoiced  boundaries  then 
a  new  set  of  rules  may  need  to  be  developed.  These  rules  must  make  good  frequency 
selection  choices  at  the  boundaries.  Additionally,  a  method  of  determining  when  the 
boundaries  occur  would  need  to  be  developed. 

Finally,  the  most  likely  reason  for  the  noise  is  that  the  system  is  trying  to 
reconstruct  a  timeslice  which  is  generally  made  of  high  frequency  components  during 
one  half  of  the  (unvoiced)  and  lower  frequency  components  in  the  other  half(voiced). 
The  reconstruction  is  attempted  with  a  small  number  of  frequency  components.  The 
fact  that  the  signal  is  changing  quickly  and  the  signal  is  sparsely  sampled  creates  a  “ 
time  aliasing”  effect.  One  method  of  eliminating  or  reducing  the  effect  would  be  to 
find  a  way  to  sense  that  the  timeslice  is  a  voiced-unvoiced  timeslice  and  implement  a 
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new  set  of  frequency  selection  rules  to  cancel  the  aliasing  effect.  The  new  rules  would 
need  to  sample  the  spectra  more  often  in  the  quickly  changing  speech  timeframes.  A 
sampling  scheme  which  knows  how  fast  the  waveform  is  changing  will  be  extremely 
difficult  to  develop  because  the  system  must  be  non-causal.  The  delay  due  to  the 
non-causality  must  remain  small  enough  to  prevent  the  system  users  from  realizing 
that  there  is  a  delay  or  the  “  real  time”  appearance  of  the  system  will  be  lost. 
Another  approach  to  solving  the  voiced-unvoiced  boundary  problem  could  be  to 
have  a  method  of  adaptively  adjusting  the  timeslice  size  based  on  the  input  speech. 
This  could  be  accomplished  by  having  a  wavelet  system  looking  ahead  of  the  Fourier 
system.  The  wavelet  system  would  recognize  the  different  phonemes  and  would  make 
speech  timeslice  boundaries  accordingly,  this  type  system  is  particularly  attractive 
because  it  would  not  only  solve  the  voiced-unvoiced  problem  it  would  possibly  extend 
the  average  timeslice  times.  Thus  decreasing  the  data  rate  considerably. 

The  Fourier  based  speech  coding  system  has  performed  well  in  reconstructing 
low  bit  rate  speech  signals.  It  has  been  shown  that  it  is  capable  of  reproducing  low 
bit  rate  (4800  bps)  speech  at  near  toll  quality.  With  further  research,  the  Fourier 
system  will  probably  perform  as  well  as  Linear  Predictive  Coding  (LPC)  speech 
coding  systems  and  possibly  will  perform  better  than  these  systems. 
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Appendix  A.  Listening  Test  Result  Charts 


The  data  collected  in  the  listening  tests  described  in  Chapter  5  are  presented 
here.  Each  plot  represents  the  average  data  collected  from  each  volunteer.  The 
listeners  initials  are  shown  in  parentheses  to  the  right  of  the  plot  title.  All  plots  of 
quality  verses  frequency  are  presented.  The  plots  for  quality  verse  phase  bits  follow. 
Finally,  the  quality  verses  amplitude  bits  curves  are  presented. 
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Quality  I  ”  I  Quality 


26.  Average  Quality  Score  verses  Number  of  Frequency  Components  (L) 


Figure  27.  Average  Quality  Score  verses  Number  of  Frequency  Components  (L) 
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Figure  28.  Average  Quality  Score  verses  Number  of  Frequency  Components  (L) 


Figure  29.  Average  Quality  Score  verses  Number  of  Frequency  Components  (L) 
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Quality  variat  Nunbar  of  Praquanciaa  (kk) 


0.  Average  Quality  Score  verses  Number  of  Frequency  Components  (L) 


Figure  31.  Average  Quality  Score  verses  Number  of  Frequency  Components  (L) 
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Quality  varaaa  Number  of  Frequencies  (st) 


Figure  32.  Average  Quality  Score  verses  Number  of  Frequency  Components  (L) 
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Quality  |  brt  I  Quality 


igure  33.  Average  Quality  Score  verses  Number  of  Phase  Bits  (P) 


Quality  |  m  Quality 


Quality  verses  Number  of  Phase  Qua;  ;isation  Bits  (mp) 


igure  35.  Average  Quality  Score  verses  Number  of  Phase  Bits  (P) 


Figure  36.  Average  Quality  Score  verses  Number  of  Phase  Bits  (P) 
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Quality- 


Figure  37.  Average  Quality  Score  verses  Number  of  Phase  Bits  (P) 


Figure  38.  Average  Quality  Score  verses  Number  of  Phase  Bits  (P) 
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Quality  veraec  Number  of  Amplitude  Quantization  Bita  (b») 


ire  40.  Average  Quality  Score  verses  Number  of  Amplitude  Bits  (A) 


Figure  41.  Average  Quality  Score  verses  Number  of  Amplitude  Bits  (A) 
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Quality  OS  I  Quality 


are  42.  Average  Quality  Score  verses  Number  of  Amplitude  Bits  (A) 


Figure  43.  Average  Quality  Score  verses  Number  of  Amplitude  Bits  (A) 
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Quality  |  Og'  I  Quality 


ire  44.  Average  Quality  Score  verses  Number  of  Amplitude  Bits  (A) 


Figure  45.  Average  Quality  Score  verses  Number  of  Amplitude  Bits  (A) 
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Figure  46.  Average  Quality  Score  verses  Number  of  Amplitude  Bits  (A) 


Figure  47.  Average  Quality  Score  verses  Number  of  Frequencies  (L)  All  Listeners 


65 


Figure  49.  Average  Quality  Score  verses  Number  of  Amplitude  Bits  (A)  All 
Listeners 
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Appendix  B.  Computer  Source  Code 


/***********************************************  **********/ 

/**  **/ 

/**  SPEECH  PROCESSIMG  SYSTEM  **/ 

/**  **/ 

/*********************************************************/ 

#  include  stdio 

#  include  math 

#  include  stdlib 

#  define  pi  3.1415926535 
«  define  e  2.71828182846 

#  define  hw  1 

#  define  conv  0 

#  define  slice  640 

#  define  jitter  2 

double  max,  maxfreq,  compensate.vector [256] ,  quant .vector [256] ; 
double  qexp_amp_vector[266] ,  count.vector  [512] ,  vcutoff; 
double  bcutof f ; 
int  freq_vector[256] ,  vecsize; 

mainO 

{ 

double  temp. even [slice] ,  temp.odd [slice] ,  txamp.vector [256] ; 

double  output.vector [slice] ,  txphase_vector[256] ; 

double  time.vector [slice] ,  exp_amp_vector[512] ,  check; 

double  hamming_frame[1024] ,  amp_vector[il2] ,  phase_vector[512] ; 

double  timeframe.vector [slice] ,  exp_phase_vector[512] ; 

int  i,  j,  k,  r,  n,  x,  bl,  b2,  b3,  hi,  size,  begin; 

int  putfile,  end,  count,  harm[256],  sum; 

int  amp.bits,  phase.bits,  windou.size,  components; 
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char  Temp_in[25] ,  Temp_in_f ile[22) ,Temp_out_file[22] ; 

char  Temp_out[25] ,  temp_file[25] ; 

FILE  * input _ptr,  *output_ptr ,  *data_ptr,  *temp_ptrj 


printl  ("  Enter  input  file  name  lets  (.and)  \n")i 
scanf  ('"/.s",  Temp.in.file) ; 

printl  ("  Enter  output  file  name  less  (.out)  \n"); 
scan!  ("%s" ,  Temp_out_lile) ; 

printK"  Enter  the  number  of  amplitude  bits\nM); 
scanl("'/,d",  Jkamp_bits); 

printK"  Enter  the  number  of  phase  bits\n"); 
scanl("'/.d",  Itphase.bits) ; 

/♦printK"  Enter  the  voiced  cutoff  threshold  \n")j 
scanK"*/.f",  kveutoff); 

printK"  Enter  the  blank  space  cutoff  threshold  \n"); 
scanf ("'/,f"»  Jkbcutoff);*/ 

printfC'Enter  the  number  of  spectral  components  to  be  chosen  \n"); 

scanf  ("*/,d" ,  ^components) ; 

sprintf  (Temp.in,  '"/.s.dat",  Temp_in_file) ; 

sprintf  (Temp_out,  '"/.s.out",  Temp_out_f ile) ; 

input_ptr  =  fopen  (Temp_in,  "r"); 

output_ptr  =  fopen  (Temp_out,  "wr") ; 

i  =  window.size  2; 
i=l; 

if  (i  !=1  ) 

{ 

, '  window  size  must  be  odd  \n  Enter  window  size\n"); 
scanf  l  /,d",  kwindow.size) ; 

> 

/*  open  file,  read  to  end  of  file  */ 
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mx  *  0; 
maxfreq  =  0; 
i«l; 

begin  *  i; 
sun  *  0; 

setup_quant(amp_bit«) ; 
setup_compvec(compensate_vector); 
/*  retd  in  header  */ 

ior  (r  =  0;  r  <  128;  ++r) 

b3  =  getw( input _ptr) ; 
putv(b3 , output _ptr) ; 

> 

while  ( !feof (input _ptr)) 


/*  take  pointer  past  header  data  and  read  in  data 
with  50  percent  overlap  */ 

j  *  slice  +  (i-i)*(slice) ; 
fseek  (input_ptr, j ,0) ; 

/*  read  in  speech  time  frame  vectors  */ 

for  (  k  =  0;  k  <  slice  kk  Ifeof (input_ptr);  ++k  ) 

/*  read  in  byte  one  and  byte  two  then  convert  msd  */ 
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bl  =  getc(input_ptr) ; 
b2  =  gate ( input _ptr) ; 
hi  =  b2*256  +  bl; 

/*  convert  negative  numbers  by  two's  compliment  */ 

if  (hi  >*  32767) 
hi  *  hi  -  66S36; 

timeframe_vector[k]  =  (lloat)hl; 

>  /**  end  lor  k  **/ 


hamming_window(timeframe_vector,hamming_frame); 

llt(1024,hamming_lraffle,amp_vector,  phase.vector) ; 

lreq_8elect(amp_vector,  phase.vector,  txamp.vector , 
txphase_vector,  lreq_vector) ; 

comp_sele'ct(txamp_vector,  txphase_vector,  lreq_vector, 
harm,  components,  i); 

il  (txamp_Vector[0]  >  0) 

energy_coneerve(amp_vector,  txamp_vector,  components); 

expand (txamp. vector,  txphase.vector,  lreq..vector,  hum, 
exp_amp_vector,  exp_phase_vector,  components); 

quantphase(exp_phase_vector,  phase_bits) ; 

quantamp(exp_amp_vector,  qexp_amp_vector ,  amp.bits); 

/*  convolve(exp_amp_vector,  exp_phase_vector,  harm, 
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Ireq.vector ,  components) ; */ 


lreq_to_time(qexp_aap_vector ,  exp.phase.vector,  time.vector) ; 
reduce(time_vector,  output. vector,  i,  temp. even,  temp.odd); 
putiile  «  i  %  2; 
il  (putiile  «  0) 

normalize (output. vector,  output.ptr) ; 
begin  =  0; 

++i; 

>  /**  end  vhile  (!eol)  *♦/ 

>/**  end  main  **/ 


/************************************************************/ 


/* 

/* 

/* 

/+ 

/♦ 

/* 

/♦ 

/* 


SUBROUTINE  SETUP  COMPENSATE  VECTOR 

This  routine  sets  up  a  vector  which  is  used  to 
multiply  the  amplitude  vector  l*r  decision  making. 
The  vector  compensates  lor  the  6  dB  per  octave 
roiloll  in  energy  in  speech  sign'^s. 


*/ 

*/ 

+/ 

*/ 

*/ 

*/ 

*/ 

*/ 


Zee**********************************************************/ 


setup.compvec(compensate.vector) 
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double  compensate..vector[] ; 

{ 

int  n; 

1 or  (n  *  0;  n  <  38;  ++n) 
compensate_vector[n]  =  1; 

lor  (n  *  38;  n  <  77;  ++n) 
compensate_vector[n]  =  pow(e,  ,0355*(n-38)) ; 

lor  (n  =  77;  n  <  154;  ++n) 
compensate_vector[n]  =  4*pow(e,  .018*(n-77)) ; 

lor  (n  =  154;  n  <  256;  v+n) 
compensate_vector[n]  =  16*pow(e,  .009*(n-154)) ; 

> 

/*****>***********+******♦*****************+***************/ 


/*  *1 

/*  SUBROUTINE  WINDOW  */ 

/*  */ 

/*  This  subroutine  sets  up  a  vector  called  window  */ 

/ *  lor  multiplication  with  each  time  Irame  vector.  */ 

/*  Windowing  prevents  the  spectral  leakage  caused  */ 

/*  by  multiplying  the  time  wavelorm  by  a  rect.  */ 

/*  ♦/ 


/**********************************************************/ 


hamming.window  (time! xame_vector ,  hamming_lrame) 
double  timelrame_vector[] ,hamming_lrame[] ; 
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{ 

ict  z,  y,  s; 

double  ham.vector [slice] ; 


for  (■  =  0;  s  <  1024;  ++e) 
hamming_f ramets]  =  0; 

for  (y=0;  y  <  slice;  ++y) 
ham_vector[y]  =  sin(pi*y/(slice-l)) ; 


for  (z  =  0;  z  <  slice;  ++z) 

{ 

hamming_f rame [z]  =  timef rame_vector [z] *ham_vector [zj; 

> 

>  /**  end  hamming  uindow  **/ 


/***********************************************************/ 
/*  */ 

/*  SUBROUTINE  FFT  */ 

/*  */ 

/*  This  subroutine  takes  the  digitized  speech  * / 

/*  timeframes  and  performs  a  fast  Fourier  transform  */ 

/*  on  the  data.  The  output  is  two  vectors:  */ 

/*  amp_vector  represents  the  magnitude  of  the  */ 

/*  transform  and  phase.vector  represents  the  phase  */ 

/*  of  the  transfoi’  */ 

/*  */ 

/***********************************************************/ 


fft(n,  hamming_framej.  amp_vector,  ph..se_vcctc  .• ' 
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int  n; 

double  hamming_lreme[] ,  amp_vector[] ,  phas«_vector[] ; 

int  nv,nm,i, j ,k,m,le,ld,p; 

double  xi[1024] ,ur ,ui,rt ,it,wr,wi,up; 


lor  (i  ■  0;  i  <  n;  i++) 

{ 

xiCi]  =  Oj 

> 

nv  =  n/2;  nm  =  n  -  1;  j  =  1; 

■  =  log((n  +  1)*  1.0)/log(2.0); 

lor  (i  =  i;  i  <~  n;  i++) 

{ 

''amming_lrame  [r.-i+i]  =  hamming_lrame[n-i]  ; 
xi[n-i+l]  =  -liCn-i]; 

> 

lor  (  i  -  1;  i  <=  nm;  i++  ) 

il  (i<j) 

{ 

rt  =  hamming_lrame[j3 ; 
it  =  xitj] ; 

bamming_lrame[j]  =  hamming_lrame[i] ; 
xi[j]  =  xi[i] ; 
hamming.lrame [i]  =  rt; 
xi[i]=it; 

> 

k=  nv; 
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shile  (k<j) 

{ 

j-5*: 

k/=2; 

> 

j+*k; 

> 


lor  (k=l;  k<=a;  k++) 

{ 

Id  =  pou(2.0,k*i.0) ; 
le  =  ld/2; 
wr  *  cos(pi/le) ; 
wi  =  -sin(pi/le); 
ur  =  1.0; 
ui  =  0.0; 

lor  (j  =  1;  j  <=  le;  j++  ) 

{ 


lor(  i  =  j;  i  <=  n;  i+=ld) 

{ 

p  =  i  +  le; 

rt  =  haaaing_lraae[p]*ur-xi[p]*ui; 
it  =  haaaing_lraae[p]+ui+xi[p]*ur; 
hamaing_lraae[p]  =  haaaing_lraae [i] -rt ; 
xi[p]  =  xiCi]  -  it; 

haaaing_lraae[i]  =  hamming_lraae [i]  +  rt 
xi[i]  =  xi[i]  +  it; 

> 

up  =  ur*ur  -  ui*ni; 
ui  =  ui*nr  +  ur*vi; 
ur  =  up; 
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> 

> 

lor  (i  =  1 ;  i  <=  nj  i++  ) 

{ 

hamming_lrame[i-l]  =  hamming.! rame [i]  ; 
xi[i-l]  =  xi[i]; 

> 

lor  (j  =  0;  j  <  512;  ++j) 

{ 

amp.vectorCj]  =  sqrt  (hamming.lrame  [j]  *hamming_lrame  [j]  + 
xi[j]*xi[j]); 

count .vector [ j] =count_vector [ j] +amp_vector  C j] /60 ; 
it  (hamming.lrame  [ j]  ==  0  tt  xi[j]  ==  0) 
phase.vectorCj]  =  0; 
else 

phase.vectorCj]  =  atan2(xi[j] .hamming.lrame [j]); 

> 

}  /**  end  lit  **/ 


/it***************************************************************/ 
/*  */ 
/*  SUBROUTIHE  FILTER  */ 
/*  */ 
l *  This  subroutine  takes  in  the  amplitude  vector  Irom  */ 
/*  the  FFT  and  smoothes  the  curve  by  averaging  all  */ 
/*  values  vithin  a  window  size.  The  routine  passes  */ 
/*  back  a  smoothed  amplitude  vector.  */ 
/*  */ 
/***♦**♦**♦*♦**********♦**********************♦******************/ 
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filter (vindov.tize,  amp_vector) 


int  window_»ize; 
double  amp.vector  □ ; 

{ 

int  i,  j; 

double  temp,  f iltered_amp[256] ; 

f or  (i  =  0;  i  <  256;  ++i) 

liltered_amp[i]  =  amp_vector [i] ; 


for  (i=0;  i<(256-windou_size) ;  ++i) 


temp  =  0; 

for  (j  =  i;  j  <  i+window_size;  ++j) 

{ 

temp  =  temp+amp_vector[j] ; 

>  /**  end  for  j  **/ 

filtered_ampti+(window_size+i)/2]  -  temp/window_size; 

>  /**  end  for  i  **/ 
for  (i  =  0;  i  <  256;  ++i) 

amp_vector[i]  =  filtered_amp[i] ; 

>  /**  end  filter  **/ 
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/*****************************************************************/ 


/*  */ 

/*  SUBROUTINE  FREQUENCY  SELECT  */ 
/*  */ 
/*  This  subroutine  takes  the  spectral  components  of  the  */ 
/*  timeframe  and  selects  the  critical  components  for  */ 
/*  transmission  to  the  receiver.  */ 
/*  */ 


/**********************+*************+****************************/ 


freq.select  (amp_vector,  phase_vector,  txamp.vector, 
txphase_vector,  freq.vector) 

double  amp_vectorn ,  phase_vector []  ,txamp_vector[] j 
double  txphase_vector[] j 
int  f req_vector [] j 

{ 

double  glottal_amp,  harm.amp,  decide_vector[256'J ; 
double  compen_vector[256] ; 
int  n; 

int  glottal_bVn,  glot_harm,  harm_bin,i,x,y; 

glottal_amp  =  0.0; 
glottal_bin  *  0; 

for  (i  =  0;  i  <  2S6;  ++i) 

< 
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decide_vector[i3  =  amp_vector[i3*coapensate_vector[i3 ; 
txaap_vector[i3  =  0; 

> 


for  (i  *  5;  i  <  16;  ++i) 

{ 

if  (abe(decide_vector[i3)  >»  abs(glottal_aap)) 

glottal_aap  =  dacide_vector[i] ; 
glottal_bin  =  i; 

>  /*  end  if  */ 

>  /*  end  for  */ 

txaap_vector[03  =  amp_vector [glottal_bin] ; 
freq_vector[0]  =  glottal.bin ; 
txphase_vector [0]  =  phase.vector [glottaljbin] ; 
y  =  2; 

if  (glottal_bin  ==0) 
glottal_bin  =  7; 

while  (y*glottal_bin  <  266) 

{ 

harm_amp  =  0.0; 
harm_bin  =  0; 

for  (x  =  ((y*glottal_bin)- jitter);  x  <=  ((y*glottal_bin)+ jitter) ;  ++x) 
{ 

if  (abs(decide_vector[x3)  >=  abs(hara_aap)) 

{ 

harm_amp  =  decide_vector[x3 ; 
hann.bin  =  x; 

>  /*  end  if  */ 

>  /*  end  for  */ 

txaap_vector[y-l3  =  aap_vector[hara_bin3 ; 
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txphase_vector [y-1]  =  phase_vector[harm_bin] ; 
freq_vector[y-l]  =  harm_bin  -  (y*glottal_bin) ; 
++y; 

}  /a  and  while  a/ 
vecaiza  =  y; 

>  /a  and  freq.select  a/ 


/*****+**********aaaaaa****aa*********aaaaaaa***aaaaaaaaaaa+*aaaa/ 

/*  */ 

/a  SUBROUTINE  COMPONENT  SELECT  */ 

/*  */ 

/a  This  subroutine  takes  the  vector  which  is  comprised  */ 

/a  o f  the  harmonics  of  the  estimated  glottal  frequency  a/ 

/a  (txamp_vector)  multiplies  it  by  the  compensation  */ 

/a  vector  then  select  the  n  greatest  quantities  for  */ 

/a  speech  reconstruction,  n  is  chosen  by  the  user.  a/ 

/a  a/ 

/**aaaaaaa*aaa*a*********aaaa**********aaaaaa**aaaaaaaaaaaa**a*aa/ 


comp_select(txamp_vector,  txphase_vector,  freq_vector,  harm, 
components,  y) 

double  txamp_vector[] ,  txphase_v<  jri,  ; 
int  components,  harm[],  freq_vector [] ,  y; 


{ 

double  tempi [256] ,temp2 [256] ,temp3[2S6] ,  temp; 
double  decide_vector[256] ,  compensa_vector[256] ; 
int  x,  i,  a,  c,  d,  n; 
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c  *  Ireq.vector [0] ; 
lor  (i  *  1;  i  <  vecsize;  ++i) 
i 

d  =  c*(i+l)  +  lreq_vector[i] ; 
tempi [i]=  txamp_vector [i] ; 
temp2[i]=  txphase_vector[i] ; 
temp3[i]=  lreq_vector[i] ; 
txamp_vector[i]  =  0; 
txphase_vector[i]  =  0; 
fraq_vector[i]  =  0; 

decide_vector [i]=templ [i]+compensate_vector[d] ; 
>  /*  end  lor  i  */ 

lor  (x  =  1;  x  <  components;  ++x) 

{ 

temp  =  0; 

lor(i  =  1;  i  <  vecsize;  ++i) 

il  (abs(decide_vector[i])  >=  abs(temp)) 
harm  [x]  si; 

temp  =  decide_vector[i] ; 

>  /*  end  il  */ 

>  /*  end  lor  i  */ 

a  =  harmCx] ; 

txamp_vector[x]  =  tempi [a]; 
txphase_vector[x]  =  temp2,[a] ; 
lreq_vector[x]  =  temp3[al; 
decide_vector[a]  =  0; 

decide_vector[a+l]  =  .7*decide_vector[a+l] ; 
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decide_vector[a-l]  *  .7*decide_vector[a-l] ; 
decide.vector [a-2]  ■  .85*decide_vector[a-2] ; 
decide_vector[a+2]  =  .85*decide_vector[a+2] ; 
print! ("harmonic  chosen  =  '/,d"  ,harm[x] ) ; 

>  /*  end  lor  x  */ 

}  /*  end  component  select  */ 


/****e**e******e*eee***ee**ee***********e**e*>t>>i''**ee******e*******/ 

/*  */ 

/*  SUBROUTINE  PHASE  QUANTIZATION  */ 

/*  */ 

/*  This  subroutine  takes  the  continuous  phase  vectors  */ 

/*  obtained  from  the  FFT  and  quantizes  them.  This  +/ 

/*  simulates  the  process  uhich  would  be  taken  to  encode  */ 

/*  the  phases  into  binary  code.  */ 

/*  */ 

/****************************************************************/ 


quant phase ( exp_phase_vector ,  phase.bits ) 

double  exp_phase_vector[] ; 
int  phase.bits; 

{ 

int  i(  num; 

double  step.size,  rem; 

lor  (i  =  0;  i  <  256;  ++i) 

i!  (exp_phase_vector[i]  <  0) 
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•xp_phase_vector[i]  =  exp_phaee_vector[i]+2*pi; 

} 

step.size  =  2*pi/(pow(2,phase_bits)) ; 

lor  (i  ■  0;  i  <  256 j  ++i) 

i 

ii (exp_phase_vactor[i]  !=  0) 

{ 

num  *  exp_phase_vcctor[i]/step_size; 
ram  =  exp_phase_vector [i]-num*step_size; 
il(ram  >  step_size/2) 

axp_phasa_vector[i]  =  (num+l)*step_size; 

> 

•Isa 

axp_pha8e_vectorCi]  -  num  *  step_siza; 

> 

> 

> 

> 

/****************************************************************/ 


/*  */ 

/*  SUBROUTIHE  AMPLITUDE  qUAITIZATIOH  */ 

/*  */ 

/*  This  subroutine  takes  the  continuous  amplitude  vectors  */ 

/*  obtained  lrom  the  FFT  and  quantizes  them.  This  */ 

/*  simulates  the  process  which  would  be  taken  to  encode  */ 

/*  the  amplitudes  into  binary  code.  */ 

/*  */ 


/a****************************************'-**********************/ 
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quant a»p(«xp_amp_vector , qexp.amp.vector ,  aap_bits) 


doubl*  exp_aap_vector []  ,  qexp_aap_vectorn ; 
int  amp.bits ; 

{ 

iht  x,  steps,  num; 

double  coap_quant,  rea,  step.size; 

steps  *  pow(.2,awp_bits) ; 
step.size  =  800000/(steps-l) ; 

lor  (x  =  Oj  x  <  2bS;  ++x) 
qexp_aap_vector[x]  =  0; 

lor  (x  *  0;  x  <  256;  ++x) 

il(exp_amp_vector[x]  >  1) 

{ 

comp.quant  =  step_size/compensate_vector[x] ; 
num  -  exp_aap_vector[x] /comp.quant; 
rem  =  exp_amp_vector[x]-  num*comp_quant ; 
il(rem  >  comp_quant/2) 

qexp_aap_vector [x]  =  (num  +  l)*comp_ quant; 
else 

qexp_amp_vector[x]  =  num*comp_quant; 

}  /*  end  il  */ 

>  /*  end  lor  x  */ 

> 


/*****************«  >*********************************4***********/ 
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/* 

/* 

/* 

/* 

/* 

/* 

/* 

/♦ 

/* 

/* 

/* 


SUBROUTIIE  EXERGY  COISERVATIOX 

This  subrotttins  compensates  for  ths  fact  that 
different  amounts  of  energy  is  sliainatsd  from  frame 
to  Iraas  by  ths  frequency  sslsct  subroutine. 
Compensation  is  accoaplishsd  by  calculating  ths 
energy  in  the  origional  frame  and  the  tz  Irene,  then 
a  coapenstion  factor  is  calculated  lor  multiplication 
ol  the  spectral  components  ol  the  tx  amplitude  vector. 


*/ 

*/ 

*/ 

*/ 

*/ 

*/ 

*/ 

♦/ 

♦/ 

*/ 

*/ 


/****************************************************************/ 


energy_conserve(amp_vector,  txamp.vector,  "onents) 

double  amp_vector[3 ,  txamp_vector[] ; 
int  components; 

{ 


int  i; 

double  tempi,  temp2,  comp_f actor; 


templ=0; 

temp2=0; 

comp_lactor=0; 

lor  (i  =  0;  i  c  266;  ++i) 

{ 

tempi  =  tempi  +  pow(amp_vector[i] ,2) ; 
>  /*  end  lor  i  */ 
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lor  (i  *  0;  i  <  components;  ++i) 

{ 

temp2  s  temp2  +  poa(txamp_vector[i] ,2); 

> 

if  (tempi  >  1700000000.0) 
eomp_f actor  =  sqrt(templ/temp2); 

else 

comp.lactor  =  1; 

lor  (i  «  0;  i  <  2S6;  ++i) 

tzemp.vector [i]  =  txamp_vector [i] *comp_l actor; 

}  /*  end  energy  conserve  */ 


/a*************************************************************/ 
/*  */ 
/*  SUBROUTINE  EXPAND  */ 
/*  */ 
/*  This  subroutine  expands  the  transmitted  amplitude  */ 
/*  frame  from  its  origional  size  to  the  size  which  */ 
/*  matches  the  origionl  amplitude  vector  */ 
/*  */ 
/**************************************************************/ 


expand(txamp_vector,txphase_vector,  freq_vector,  harm, 
exp_amp_vector,  exp.phase.vector,  components) 

double  txamp“vectorC3 ,  txphase_vector  [] ,  exp_aap_vector  □  ; 

double  exp_phase_vector [] ; 

int  freq_vectorD ,  harm[] ,  components; 
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i 


i 

int  x,  y,  z; 

double  sinc_vec[ll] ,  temp_vec[ll] ,  hold; 
int  c; 
int  d; 
int  i; 

int  h; 

■inc_vec[5]  *  1; 

1 or  (c  =  6;  c  <  11;  ++c) 

sinc_vec[c]  *  sinc_vec[10-c]  «»in(pi*(c-6)/2)/(pi*(c-B)/2) ; 


x  *  lreq_vector[0] ; 

lor  Cy  «  0;  y  <  256;  ++y) 

{ 

exp_pha»e_vector[y]  «  0; 
exp.amp.vector [y]  =  0; 

>  /**  end  lor  y  **/ 

il  (conv  ==  1) 

hold  =  txamp_vector[0] ; 

lor  (c  =  0;  c  <  11;  ++c) 

{ 

exp_aap_vector  Cc+(x-5)] =  hold*sinc_vec[c] ; 
exp_phase_vector[c+(x-5)]  =  txphase_vector[0] ; 
} 

lor  (y  c  1;  y  <  components;  ++y) 
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{ 

z  m  x*(harm[y]+l)+  lreq_vector[y] ; 

hold  »  txanp. vector [y] ; 
lor  (i  ■  0;  i  <  11;  ++i) 

{ 

exp_amp_vector[i+(z-6)]=holdasinc_vec[i] ; 
exp_phase_vector[i+(z-5)]  *  txphase_vector[y] ; 

> 

>  /**  end  lor  y  aa/ 

> 

•1st 

{ 

exp.anp.vector  [x]  =  txamp_vector[0] j 
exp_phase_vector[x]  =  txphase_vector[0] ; 

lor  (y  *  1;  y  <  components;  ++y) 

{ 

z  *  x*(harm[y]+l)  +  lreq_vector[y] ; 
axp.anp. vector [z]  -  txanp.vactorCy] ; 
exp_phase_vector[z]  =  txphase_vector[y] ; 

>  /**  and  lor  y  aa/ 

}  /aa  and  else  **/ 

>  /aa  and  expand  aa/ 

/*********aaaeeaaeeeeeeeeaeeaeeeeeaeaaeeeeaeaeaaaaaaaeaae/ 


/a  */ 
/a  SUBROUTINE  CONVOLVE  a/ 
/*  */ 
/a  This  subroutine  takes  the  reduced  speech  a/ 
/a  spectra  and  convolves  it  vith  a  sine  lunction.  a/ 
/a  a/ 
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/***************+*************************************** 


convolve(exp_amp.vector,  axp.phase.vector ,  harm,  Ireq.vector,  components) 

double  exp_nmp_vector[] ,  exp_ph»se_vector  [] ; 

int  harm[],  Ireq.vector [] ,  components; 

{ 

double  sinc.vecCll] ,  temp_vec[ll] ,  hold; 

int  c; 

int  d; 

int  i; 

int  h; 

sinc_vec[5]  =  1; 

ior  (c  *  6;  c  <  11;  ++c) 

sinc_vec[c]  ■  sinc_vec[10-c]  =sin(pi*(c-S)/2)/(pi*(c-S)/2) ; 


i  *  Ireq.vector  [0]  ; 

lor  (c  -  0;  c  <  11;  ++c) 

temp_vec[c]  =  exp.amp.vector  [i] *sinc_vec [c] ; 

exp.amp.vector [i]  =  0; 

ior  (c  *  i-5;  c  <  i+6;  ++c) 

{ 

exp.amp.vector [c]  =  temp_vec[c-(i-5)] ; 
exp_phase_vector[c]  =  exp_phase_vector[i] ; 
printlC'tv  «  %i",temp_vec[c-(i-S)]); 

> 


89 


to 

to 

for  (c  «1;  c  <  components ;  ++c) 

{ 

h«e*(harm[c]+l)+freq_vector[c] ; 
hold  *  exp.amp. vector [h] ; 
exp_anp_vector [h]  =  0; 
for(  d  «  0;  d  <  11;  ++d) 

temp_vec[d]  =  hold*sinc_vec[d] ; 
for  (d  *  h-5;  d  <  h  +  6;  ++d) 

{ 

exp_amp_vector[d]  =  temp_vec[d-(h-S)] ; 
printf("TV2  *  y,f",:;omp_vec[d-(h-5)3) ; 
exp_phase_vector[d] '«  exp.phuse_vector[h] ; 

> 

> 

>  /**  ond  convolve  **/ 


/****************** a********************************************/ 


/*  */ 

/*  FREQUENCY  TO  TIME  DOMAIN  CONVERSION  */ 

/*  */ 

/*  This  subroutine  takes  the  selected  spectral  */ 

/*  component's  amplitude,  phase  and  frequency  and  */ 

/*  reconstucts  an  approximation  of  the  origional  speech  */ 

/*  using  the  sinusoidal  model  of  speech  signals.  */ 

/*  */ 


/***************************************************************/ 


freq_to_time(exp_amp_vector,  exp_phase_vector,  time.vector) 
double  exp_amp_vector[] ,  exp_phase_vector[] ,  time.vector [] ; 
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iat  a,  b,  c,  d; 

lor  (a  »  0;  r.  <  alica;  ++a) 

{ 

tim#_vector[a]  *  0; 

lor  (b  «  0;  b  <  256;  ++b) 

{ 

11  (abs(exp_amp_vector[b] )  >  100) 
time_vector[a]  =  time. vector [a]  + 

(exp_anp_vector  [b] *cos ( (2*pi*b*a/1024) 
+  exp.phase.vector [b] ) ) ; 

> 

} 

> 


/****************************************************************/ 


/*  */ 

/*  SUBROUTINE  REDUCE  */ 

/*  */ 

/*  This  routine  reduces  the  2n-l  overlapped  Irenes  */ 

/+  to  the  origional  number  ol  Irames  (n).  The  Irenes  */ 

/*  are  multiplied  by  the  sin  vindov  lunction  belore  */ 

/*  reduction  to  smooth  the  ellects  ol  discontinuities  */ 

/*  Iron  one  Irame  to  the  next.  */ 

/*  ♦/ 


/•Me************************************************************/ 


reduce (time_vector  .output .vector,  i,  temp.even,  temp.odd). 
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double  tiae.vector  □ ,  output.vectorQ ,  teap.even[],  temp.oddt] ; 
int  i; 

{ 

int  x,  k; 


if  (i  «*  1) 

for  (x  ■  0;  x  <  slice;  ++x) 

tenp.oddEx]  *  tiae_vector[x]*sin(pi*x/(slice-l)) ; 

> 

else 

{ 

k  «  i  */,  2; 

if  (k  ==  0) 

{ 

for  (x  *  0;  x  <  slice;  ++x) 

temp_even[x]  =  time_vector [x] *sin(pi*x/ (slice-1) ) ; 

for  (x  *  0;  x  <  slice/2;  ++x) 

output .vector  [x]  «  temp_odd[x] ; 

output.vector [x+Cslice/2)]  =  (temp_odd[x+(slice/2)] 
tenp.evenCx]); 

> 

for  (x  =  0;  x  <  slice;  ++x) 
temp_odd[x]  =  0; 


> 


* 

lor  (x  *  0;  x  <  slice;  ++x) 

temp.oddCx]  *  tino_v«etor[x]+ain(pi*x/(alic«-l)); 

lor  (x  *  0;  x  <  alica/2;  ++x) 

temp.oddCx]  *  temp.odd [x]  ♦  temp_even[x+(slice/2)] ; 

lor  (x  *  0;  x  <  alica;  ++x) 
temp.evenCx]  =  0; 

> 

> 

> 

/****************************M***********t***************Maee**eeeee/ 


/*  */ 

/a  SUBROUTINE  NORMALIZE  a/ 

/*  */ 

/*  Thia  subrout in a  takas  tha  approximation  ol  tha  origional  */ 

/a  speech  waveform  and  normalizes  tha  data  to  make  it  */ 

/a  compatabla  with  the  DSC  digital  to  analog  conrertor .  a/ 

/a  Tha  normalized  data  is  than  written  to  an  output  lile  a/ 

/a  specilied  by  tha  user.  */ 

/*  */ 


/eeeeeeeeeeeeeeeeeeeeeeeeaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeaeeeeeeeeeeeee/ 

normalize (output .vector ,  output _ptr) 

double  output .vector [] ; 

FILE  aoutput.ptr; 

{ 

int  bytel,  byte2,  intnumber,  i; 
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char  byte char 1 ,  bytechar2; 

double  nuuber ; 


b 


lor  (i  ■  0;  i  <  slice;  ++i) 

intnumber  *  (int)  output_vector[i]/225; 
ii  (intnumber  <  0) 

intnumber  *  intnumber  +  85536; 


bytel  =  (int)  intnumber  %  256; 
byte2  *  (int)  intnumber/256; 
bytechr.il  »■  (char)  bytel; 
bytechar2  =  (char)  byte2; 

put c (byte char 1 ,  output_ptr) ; 
putc(bytechar2,  output „ptr); 


} 


> 
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